add directory study

This commit is contained in:
gohigh
2024-02-19 00:25:23 -05:00
parent b1306b38b1
commit f3774e2f8c
4001 changed files with 2285787 additions and 0 deletions

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1,73 @@
<html><head>
<title>HPFS: Application Programs and the HPFS</title>
</head>
<body>
<center>
<h1>Application Programs and the HPFS</h1>
</center>
Each of the OS/2 releases thus far have carried with them a major
discontinuity for application programmers who teamed their trade in the
MS-DOS environment. In OS/2 1.0, such programmers were faced for the first
time with virtual memory, multitasking, inter-process communications, and
the protected mode restrictions on addressing and direct control of the
hardware and were challenged to master powerful new concepts such as
threading and dynamic linking. In OS/2 Version 1.1, the stakes were raised
even fufiher. Programmers were offered a powerful hardware-independent
graphical user interface but had to restructure their applications
drastically for an event-driven environment based on objects and message
passing. In OS/2 Version 1.2, it is time for many of the file- oriented
programming habits and assumptions carried forward from the MS-DOS
environment to fall by the wayside. An application that wishes to take
full advantage of the HPFS must allow for long, free-form, mixed-case
filenames and paths with few restrictions on punctuation and must be
sensitive to the presence of EAs and ACLs.
<p>
After all, if EAs are to be of any use, it won't suffice for applications
to update a file by renaming the old file and creating a new one without
also copying the EAs. But the necessary changes for OS/2 Version 1.2 are
not tricky to make. A new API function, DosCopy, helps applications create
backups--it essentially duplicates an existing file together with its EAs.
EAs can also be manipulated explicitly with DosQFileInfo DosSetFileInfo
DosQPathlnfo and DosSetPathInfo. A program should call DosQSysInfo at run
time to find the maximum possible path length for the system and ensure
that all buffers used by DosChDir DosQCurDir and related functions are
sufficiently large. Similarly the buffers used by DosOpen DosMove
DosGetModName, DosFindFirst D DosFindNext and like functions must allow for
longer filenames. Any logic that folds cases in filenames or tests for the
occurrence of only one dot delimiter in a filename must be rethought or
eliminated. The other changes in the API will not affect the average
application. The functions DosQFileInfo DosFindFirst and DosFindNext now
retain all three sets of times and dates (created last accessed last
motified) for a file on an HPFS volume but few programs are concerned with
time and date stamps anyway. DosQFslnfo is used to obtain volume labels or
disk characteristics just as before and the use of DosSetFsInfo for volume
labels is unchanged. There are a few totally new API functions such as
DosFsCtl (analogous to DosDevlOCtl but used for communication between an
application and an FSD) DosFsAttach (a sort of explicit mount call) and
DosQFsAttach (determines which FSD owns a volume) these are intended mainly
for use by disk utility program. In order to prevent old OS/2 applications
and MS-DOS applications running in the DOS box from inadvertently damaging
HPFS files a new flag bit has been defined in the EXE file header that
indicates whether an application is HPFS-aware. If this bit is not set the
application will only be able to search for open or create files on HPFS
volumes that are compatible with the FAT' file system's 8.3 naming conventions.
lf the bit is set OS/2 allows access to all files on an HPFS volume because
it assumes that the program knows how to handle long free-form filenames and
will take the responsibility of conserving a file's EAs and ACLs.
<p>
<hr>
&lt; <a href="faultol.html">[Fault Tolerance]</a> |
<a href="hpfs.html">[HPFS Home]</a> |
<a href="sum.html">[Summary]</a> &gt;
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

View File

@@ -0,0 +1,43 @@
<html><head>
<title>HPFS: Design</title>
</head>
<body>
<center>
<h1>
Design Goals and Implementation of the<br>
New High Performance File System
</h1>
</center>
The High Performance File System (hereafter HPFS), which is making its first
appearance in the OS/2 operating system Version 1.2, had its genesis in the
network division of Microsoft and was designed by Gordon Letwin, the chief
architect of the OS/2 operating system. The HPFS has been designed to meet
the demands of increasingly powerful PC's, fixed disks, and networks for many
years to come and to serve as a suitable platform for object-oriented languages,
applications, and user interfaces.
The HPFS is a complex topic because it incorporates three distinct yet
interrelated file system issues. First, the HPFS is a way of organizing data
on a random access block storage device. Second, it is a software module that
translates file-oriented requests from an application program into more
primitive requests that a device driver can understand, using a variety of
creative techniques to maximize performance. Third, the HPFS is a practical
illustration of an important new OS/2 feature known as Installable File Systems.
This article introduces the three aspects of the HPFS. But first, it puts the
HPFS in perspective by reviewing some of the problems that led to the system's
existence.
<p>
<hr>
<a href="hpfs.html">[HPFS Home]</a> |
<a href="fat.html">[FAT File System]</a> &gt;
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

View File

@@ -0,0 +1,91 @@
<html><head>
<title>HPFS: Directories</title>
</head>
<body>
<center>
<h1>Directories</h1>
</center>
Directories like files are anchored on Fnodes. A pointer to the Fnode for
the root directory is found in the Super Block The Anodes for directories
other than the root are reached through subdirectory entries in their parent
directories. Directories can grow to any size and are built up from 2Kb
directory blocks, which are allocated as four consecutive sectors on the disk.
The file system attempts to allocate directory blocks in the directory band,
which is located at or near the seek center of the disk. Once the directory
band is full, the directory blocks are allocated wherever space is available.
Each 2Kb directory block contains from one to many directory entries.
A directory entry contains several fields, including time and date stamps,
an Fnode pointer, a usage count for use by disk maintenance programs, the
length of the file or directory name, the name itself, and a B-Tree pointer.
Each entry begins with a word that contains the length of the entry. This
provides for a variable amount of flex space at the end of each entry, which
can be used by special versions of the file system and allows the directory
block to be traversed very quickly (<a href="#fig5.html">Figure 5</a>).
The number of entries in a directory block varies with the length of names.
If the average filename length is 13 characters, an average directory block
will hold about 40 entries.
<p>
The entries in a directory block are sorted by the binary lexical order of
their name fields (this happens to put them in alphabetical order for the US.
alphabet). The last entry in a directory block is a dummy record that marks
the end of the block. When a directory gets too large to be stored in one
block, it increases in size by the addition of 2Kb blocks that are organized
as a B-Tree (see B-T tees and B+ Trees ). When searching for a specific name,
the file system traverses a directory block until it either finds a match or
finds a name that is lexically greater than the target. In the latter case,
the file system extracts the Tree pointer from the entry. If there is no
pointer, the search failed otherwise the file system follows the pointer to
the next directory block in the tree and continues the search. A little
back-of-the-envelope arithmetic yields some impressive statistics. Assuming
40 entries per block, a two-level tree of directory blocks can hold 1640
directory entries and a three-level tree can hold an astonishing 65,640 entries.
In other words, a particular file can be found (or shown not to exist) in a
typical directory of 65,640 files with a maximum of three disk hits--the
actual number of disk accesses depending on cache contents and the location
of the file's name in the directory blockB-Tree.That's quite a contrast to
the FAT file system, where in the worst case more than 4000 sectors would
have to be read to establish that a filewas or was not present in a directory
containing the same number of files. The B-Tree directory structure has
interesting implications beyond its effect on open and find operations.
A file creation, renaming, or deletion may result in a cascade of complex
operations, as directory blocks are added or freed or names are moved from
one block to the other to keep the tree balanced. In fact, a rename
operation could theoretically fail for lack of disk space even though the
file itself is not growing. In order to avoid this sort of disaster, the
HPFS maintains a small pool of free blocks that can be drawn from in a
directory emergency; a pointer to this pool of free blocks is stored in the
Spare Block.
<p>
<center>
<a href="fig5.gif" name="fig5">
<img src="fig5.gif" alt="[Fig. 5]" border=0></a>
</center>
<p>
<b>FIGURE 5</b>:
Here directories are anchored on an Fnode and are built up from 2Kb directory
blocks. The number of entries in a directory block varies because the length
of the entries depends on the filename. When a directory requires more than
one block the blocks are organized as a B-Tree. This allows a filename to be
located very quickly with a small number of disk accesses even when the
directory grows very large.
<p>
<hr>
&lt; <a href="fnodes.html">[Files and Fnodes]</a> |
<a href="hpfs.html">[HPFS Home]</a> |
<a href="ea.html">[Extended Attributes]</a> &gt;
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

View File

@@ -0,0 +1,65 @@
<html><head>
<title>HPFS: Extended Attributes</title>
</head>
<body>
<center>
<h1>Extended Attributes</h1>
</center>
File attributes are information about a file that is maintained by the
operating system outside the file's overt storage area. The FAT file
system supports only a few simple attributes (read only, system, hidden,
and archive) that are actually stored as bit flags in the file's directory
entry these attributes are inspected or modified by special function calls
and are not accessible through the normal file open, read, and write calls.
The HPF'S supports the same attributes as the FAT file system for historical
reasons, but it also supports a new form of file- associated, highly
generalized information called Extended Attributes (EAs). Each EA is
conceptually similar to an environment variable, taking the form
(name=value) except that the value portion can be either a null- tenninated
(ASCIIZ) string or binary data. In OS/2 1.2, each file or direc-tory can
have a maximum of 64Kb of EAs attached to it. This limit may be lifted in
a later release of OS/2. The storage method for EAs can vary. If the EAs
associated with a given file or directory are small enough, they will be
stored right in the Fnode. If the total size of the EAs is too large, they
are stored outside the Fnode in sector runs, and a B+ Tree of allocation
sectors can be created to describe the runs. If a single EA gets too large,
it can be pushed outside the Fnode into a B+ Tree of its own.
<p>
The kernel API functions DosQFileInfo and DosSetFileInfo have been expanded
with new information levels that allow application programs to manipulate
extended attributes for files. The new functions DosQPathInfo and
DosSetPathInfo are used to read or write the EAs associated with arbitrary
path names. An application program can either ask for the value of a
specific EA (supplying a name to be matched) or can obtain all of the EAs
for the file or directory at once. Although application programs can begin
to take advantage of EAs as soon as the HPFS is released, support for EAs
is an essential component in Microsoft's long-range plans for object-oriented
file systems. Information of almost any type can be stored in EAs, ranging
from the name of the application that owns the file to names of dependent
files to icons to executable code. As the HPFS evolves, its facilities for
manipulating EAs are likely to become much more sophisticated. It's easy to
imagine, for example, that in future versions the API might be extended with
EA functions that are analogous to DosFindFirst and DosFindNext and EA data
might get organized into B-Trees. I should note here that in addition to EAs,
the LAN Manager version of HPFS will support another class of fil-associated
information called Access Control Lists (ACLs). ACLs have the same general
appearance as EAs and are manipulated in a similar manner, but they are used
to store access rights, passwords, and other information of interest in a
networking multi user environment.
<p>
<hr>
&lt; <a href="dirs.html">[Directories]</a> |
<a href="hpfs.html">[HPFS Home]</a> |
<a href="ifs.html">[Installable File Systems]</a> &gt;
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

View File

@@ -0,0 +1,94 @@
<html><head>
<title>HPFS: FAT File System</title>
</head>
<body>
<center>
<h1>FAT File System</h1>
</center>
The so-called FAT file system ,which is the file system used in all versions of the
MS-DOS operating system to date and in the first two releases of OS/2' (Versions 1.0
and 1.1), has a dual heritage in Microsoft's earliest programming language products
and the Digital Research CP/M operating system--software originally written for
8080-based and Z-80-based microcomputers. It inherited characteristics from both
ancestors that have progressively turned into handicaps in this new era of
multitasking, protected mode, virtual memory, and huge fixed disks.
<p>
The FAT file system revolves around the File Allocation Table for which it is
named. Each logical volume has its own FAT, which serves two important functions:
it contains the allocation information for each file on the volume in the fonn
of linked lists of allocation units (clusters, which are power-of-2 multiples
of sectors) and it indicates which allocation units are free for assignment to
a file that is being created or extended.
<p>
The FAT was invented by Bill Gates and Marc McDonald in 1977 as
a method of managing disk space in the NCR version of standalone Microsoft's Disk
BASIC. Tim Paterson, at that time an employee of Seattle Computer Products (SCP), was
introduced to the FAT concept when his company shared a booth with Microsoft at the
National Computer Conference in 1979. Paterson subsequently incorporated FATs into
the file system of 86-DOS, an operating system for SCP s S-100 bus 8086 CPU boards.
86-DOS was eventually purchased by Micro-soft and became the starting point for
MS-DOS Version 1.0, which was released for the original lBM PC in August 1981.
<p>
When the FAT was conceived, it was an excellent solution to disk management,
mainly because the floppy disks on which it was used were rarely larger than
1 Mb.
On such disks, the FAT was small enough to be held in memory at all times,
allowing very fast random access to any part of any file. This proved far
superior to the CP/M method of tracking disk space, in which the information
about the sectors assigned to a file might be spread across many directory
entries, which were in turn scattered randomly throughout the disk directory.
When applied to fixed disks, however, the FAT began to look more like a bug
than a feature. it became too large to be held entirely resident and had to
be paged into memory in pieces: this paging resulted in many superfluous disk
head movements as a program was reading through a file and degraded system
throughput. in addition, because the information about free disk space was
dispersed across many sectors of FAT, it was impractical to allocate file
space contiguously, and file fragmentation became another obstacle to good
performance. Moreover, the use of relatively large clusters on fixed disks
resulted in a lot of dead space, since an average of one- half cluster was
wasted for each file. (Some network servers use clusters as large as 64Kb.)
<p>
The FAT file system 's restrictions on naming files and directories are
inherited from CP/M. When Paterson was writing 86-DOS one of his primary
objectives was to make programs easy to port from CP/M to his new operating
system. He therefore adopted CP/M's limits on filenames and extensions so the
critical fields of 86-DOS File Control Blocks (FCBs) would look almost exactly
like those of CP/M. The sizes of the FCB filename and extension fields were
also propagated into the structure of disk directory entries. In due time 86-DOS
became MS- DOS and application programs for MS-DOS proliferated beyond anyone's
wildest dreams. Since most of the early programs depended on the structure of
FCBs the 8.3 format for filenames became irrevocably locked into the system.
<p>
During the last couple of years Microsoft and IBM have made valiant attempts
to prolong the useful life of the FAT file system by lifting the restrictions
on volume sizes improving allocation strategies caching path names and moving
tables and buffers into expanded memory. But these can only be regarded as
temporizing measures because the fundamental data structures used by the FAT
file system are simply not well suited to large random access devices.
The HPFS solves the FAT file system problems mentioned here and many others
but it is not derived in any way from the FAT file system. The architect of
the HPFS started with a clean sheet of paper and designed a file system that
can take full advantage of a multitasking environment and that will be able to
cope with any sort of disk device likely to arrive on microcomputers during
the next decade.
<p>
<hr>
&lt; <a href="design.html">[HPFS Design]</a> |
<a href="hpfs.html">[HPFS Home]</a> |
<a href="hpfs_vol.html">[HPFS Volumes]</a> &gt;
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

View File

@@ -0,0 +1,84 @@
<html><head>
<title>HPFS: Fault Tolerance</title>
</head>
<body>
<center>
<h1>Fault Tolerance</h1>
</center>
The HPFS's extensive use of lazy writes makes it imperative for the HPFS to
be able to recover gracefully from write errors under any but the most dire
circumstances. After all, by the time a write is known to have failed, the
application has long since gone on its way under the illusion that it has
safely shipped the data into disk storage. The errors may be detected by the
hardware (such as a "sector not found" error returned by the disk adapter),
or they may be detected by the disk driver in spite of the hardware during a
read-after-write verification of the data. The primary mechanism for handling
write errors is called a hot fix. When an error is detected, the file system
takes a free block out of a reserved hot fix pool, writes the data to that
block, and updates the hot fix map. (The hot fix map is simply a series of
pairs of double words, with each pair containing the number of a bad sector
associated with the number of its hot fix replacement. A pointer to the hot
fix map is maintained in the Spare Block.) A copy of the hot fix map is then
written to disk, and a warning message is displayed to let the user know that
all is not well with the disk device.
<p>
Each time the file system requests a sector read or write from the disk
driver, it scans the hot fix map and replaces any bad sector members with the
corresponding good sector holding the actual data. This look aside translation
of sector numbers is not as expensive as it sounds, since the hot fix list
need only be scanned when a sector is physically read or written, not each
time it is accessed in the cache. One of CHKDSK's duties is to empty the hot
fix map. For each replacement block on the hot fix map, it allocates a new
sector that is in a favorable location for the file that owns the data, moves
the data from the hot fix block to tile newly allocated sector, and updates
the file's allocation information which may involve rebalancing allocation
trees and other elaborate operations). It then adds the bad sector to the bad
block list, releases the replacement sector back to the hot fix pool, deletes
the hot fix entry from the hot fix map, and writes the updated hot fix map to
disk. of course, write errors that can be detected and fixed on the fly are
not the only calamity that can befall a file system. The HPFS designers also
had to consider the inevitable damage to be wreaked by power failures, program
crashes, malicious viruses and Trojan horses, and those users who turn off
the machine without selecting Shut-down in the Presentation Manager Shell.
(Shutdown notifies the file system to flush the disk cache, update directories,
and do whatever else is necessary to bring the disk to a consistent state.)
<p>
The HPFS defends itself against the user who is too abrupt with the Big Red
Switch by maintaining a Dirty FS flag in the Spare Block of each HPFS volume.
The flag is only cleared when all files on the volume have been closed and
all dirty buffers in the cache have been written out or, in the case of the
boot volume since OS2.INI and the swap file are never closed), when Shutdown
has been selected and has completed its work. During the OS/2 boot sequence,
the file system inspects the Dirty FS flag on each HPFS volume and, if the
flag is set, will not allow further access to that volume until CHKDSK has
been run. If the Dirty FS flag is set on the boot volume, the system will
refuse to boot the user must boot OS/2 in maintenance mode from a diskette
and run CHKDSK to check and possibly repair the boot volume. In the event
of a truly major catastrophe, such as loss of the Super Block or the root
directory, the HPFS is designed to give data recovery the best possible
chance of success. Every type of crucial file objects including Fnodes,
allocation sectors, and directory blocks is doubly linked to both its parent
and its children and contains a unique 32-bit signature. Fnodes also contain
the initial pofiion of the name of their file or directory. Consequently,
CHKDSK can rebuild an entire volume by methodically scanning the disk for
Fnodes, allocation sectors, and directory blocks, using them to reconstruct
the files and directories and finally regenerating the freespace bitmaps.
<p>
<hr>
&lt; <a href="perform.html">[Performance Issues]</a> |
<a href="hpfs.html">[HPFS Home]</a> |
<a href="app_hpfs.html">[Application Programs and the HPFS]</a> &gt;
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.0 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.3 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.7 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.4 KiB

View File

@@ -0,0 +1,156 @@
<html><head>
<title>HPFS: Illustrations</title>
</head>
<body>
<center>
<h1>HPFS: Illustrations</h1>
</center>
<center>
<a href="fig1.gif" name="fig1">
<img src="fig1.gif" alt="[Fig. 1]" border=0></a>
</center>
<p>
<b>FIGURE 1</b>:
This figure shows the overall structure of an HPFS volume. The most important
fixed objects in such a volume are the Bootblock the Super Block, and the
Spare Block. The remainder of the volume is divided into 8Mb bands. There is
a freespace bitmap for each band and the bitmaps are located between alternate
bands consequently, the maximum contiguous space which can be allocated to a
file is 16Mb.
<li><a href="hpfs_vol.html">HPFS Volume Structure</a>
<p>
<center>
<a href="fig2.gif" name="fig2">
<img src="fig2.gif" alt="[Fig. 2]" border=0></a>
</center>
<p>
<b>FIGURE 2</b>:
This figure shows the overall structure of an Fnode. The Fnode is the
fundamental object in an HPFS volume and is the first sector allocated to a
file or directory. it contains control and access history information used
by the file system, cached EAs and ACLs or pointers to same, a truncated
copy of the file or directory name (to aid disk repair programs, and an
allocation structure which defines the size and location of the file's storage.
<li><a href="fnodes.html">Files and FNodes</a>
<p>
<center>
<a href="fig3.gif" name="fig3">
<img src="fig3.gif" alt="[Fig. 3]" border=0></a>
</center>
<p>
<b>FIGURE 3</b>:
The simplest form of tracking for the sectors owned by a file is shown. The
Fnode s allocation structure points directly to as many as eight sector runs.
Each run pointer consists of a pair of 32-bit doublewords: a starting sector
number and a length !n sectors.
<li><a href="fnodes.html">Files and FNodes</a>
<p>
<center>
<a href="fig4.gif" name="fig4">
<img src="fig4.gif" alt="[Fig. 4]" border=0></a>
</center>
<p>
<b>FIGURE 4</b>:
This figure demonstrates the technique used to track the sectors owned by a
file with 9-480 sector runs. The allocation structure in the Fnode holds the
roots for a B+ Tree of allocation sectors. Each allocation sector can describe
as many as 40 sector runs. lf the file requires more than 480 sector runs,
additional intermediate levels are added to the B+ Tree, which increases the
number of possible sector runs by a factor of sixty for each new !evel.
<li><a href="fnodes.html">Files and FNodes</a>
<p>
<center>
<a href="fig5.gif" name="fig5">
<img src="fig5.gif" alt="[Fig. 5]" border=0></a>
</center>
<p>
<b>FIGURE 5</b>:
Here directories are anchored on an Fnode and are built up from 2Kb directory
blocks. The number of entries in a directory block varies because the length
of the entries depends on the filename. When a directory requires more than
one block the blocks are organized as a B-Tree. This allows a filename to be
located very quickly with a small number of disk accesses even when the
directory grows very large.
<li><a href="dirs.html">Directories</a>
<p>
<center>
<a href="fig6.gif" name="fig6">
<img src="fig6.gif" alt="[Fig. 6]" border=0></a>
</center>
<p>
<b>FIGURE 6</b>:
A simplified sketch of the relationship between an application program, the
OS/2 kernel, an installable file system, a disk drlver, and the physical disk
device. The applicatIon issues logical file requests to the OS/2 kernel by
calling the entry points for DosOpen, DosRead, DosWrlte, DosChgFilePtr, and
so on. The kernel passes these requests to the appropriate installable file
system for the volume holding the file. The installable file system translates
the logical file requests into requests for reads or writes of logical sectors
and calls a kernel File System Helper (FsHlp) to pass these requests to the
appropriate disk drlver. The disk driver transforms the logical sector
requests into requests for specific physical units, cylinders heads, and
sectors, and issues commands to the disk adapter to transfer data between the
disk and memory.
<li><a href="ifs.html">Installable File Systems</a>
<p>
<center>
<a href="figa.gif" name="figa">
<img src="figa.gif" alt="[Fig. A]" border=0></a>
</center>
<p>
<b>FIGURE A</b>:
To find a piece of data, the binary tree is traversed from the root until
the data is found or an empty subtree is encountered.
<li><a href="sum.html">Summary</a>
<p>
<center>
<a href="figb.gif" name="figb">
<img src="figb.gif" alt="[Fig. B]" border=0></a>
</center>
<p>
<b>FIGURE B</b>:
In a balanced B-Tree, data is stored in nodes, more than one data item can
be stored in a node, and all branches of the tree are the same length.
<li><a href="sum.html">Summary</a>
<p>
<center>
<a href="figc.gif" name="figc">
<img src="figc.gif" alt="[Fig. C]" border=0></a>
</center>
<p>
<b>FIGURE C</b>:
A B+ Tree has internal nodes that point to other nodes and external nodes
that contain actual data.
<li><a href="sum.html">Summary</a>
<p>
<hr>
<a href="hpfs.html">[HPFS Home]</a>
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

View File

@@ -0,0 +1,122 @@
<html><head>
<title>HPFS: Files and FNodes</title>
</head>
<body>
<center>
<h1>Files and FNodes</h1>
</center>
Every file or directory on an HPFS volume is anchored on a fundamental file
system object called an Fnode (pronounced "eff node"). Each Fnode occupies a
single sector and contains control and access history information used
internally by the file system extended attributes and access control lists
(more about this later) the length and the first 15 characters of the name
of the associated file or directory and an allocation structure
(<a href="#fig2">Figure 2</a>).
<p>
An Fnode is always stored near the file or directory that it represents.
The allocation structure in the Fnode can take several forms depending on the
size and degree of contiguity of the file or directory.
The HPFS views a file as a collection of one or more runs or extents of one
or more contiguous sectors. Each run is symbolized by a pair of
double-words--a 32-bit starting sector number and a 32-bit length in sectors
(this is referred to as run length encoding).
From an application program's point of view the extents are invisible; the
file appears as a seamless stream of bytes. The space reserved for allocation
information in an Fnode can hold pointers to as many aseight runs of sectors
of up to 16Mb each . (This maximum run size is a result of the band size and
free space bitmap placement only; it is not an inherent limitation of the file
system.)
Reasonably small files or highly contiguous files can therefore be described
completely within the Fnode (<a href="#fig3">Figure 3</a>).
<p>
HPFS uses a new method to represent the location of files that are too large
or too frag-mented for the Fnode and consist of more than eight runs.
The Fnode's allocation structure becomes the root for a B+ Tree of allocation
sectors which in turn contain the actual pointers to the file's sector runs
(see <a href="#fig4">Figure 4</a> and the sidebar, "B-Trees and B+ Trees").
The Fnode's root has room for 12 elements. Each allocation sector can contain,
in addition to various control information, as many as 40 pointers to sector
runs. Therefore a two-level allocation B+ Tree can describe a file of 480
(12x40) sector runs with a theoretical maximum size of 7.68Gb (12x40x16Mb)
in the current implementation (although the 32-bit signed offset parameter
for DosChgFilePtr effectively limits the sizes to 2Gb).
In the unlikely event that a two-level B+ Tree is not sufficient to describe
the highly fragmented file the file system will introduce additional levels
in the tree as needed. Allocation sectors in the intermediate levels can hold
as many as 60 intemal (nonterminal) B+ Tree nodes which means that the
descriptive ability of this structure rapidly grows to numbers that are nearly
beyond comprehension. For example a three level allocation B+ tree can describe
a file with as many as 28 800 (12x60x40) sector runs. Run-length encoding and
B+ Trees of allocation sectors are a memory-efficient way to specify a file's
size and location but they have other s significant advantages.
Translating a logical file offset into a sector number is extremely fast:
the file system just needs to traverse the list (or B+ Tree of lists)
of run pointers until it finds the correct range. It can then identify the
sector within the run with a simple calculation.
Run-length encoding also makes it trivial to extend the file logically if
the newly assigned sector iscontiguous with the file's previous last sector
the file system merely needs to increment the size double word of the file's
last run pointer and clear the sector's bit in the appropriate freespace bitmap.
<p>
<center>
<a href="fig2.gif" name="fig2">
<img src="fig2.gif" alt="[Fig. 2]" border=0></a>
</center>
<p>
<b>FIGURE 2</b>:
This figure shows the overall structure of an Fnode. The Fnode is the
fundamental object in an HPFS volume and is the first sector allocated to a
file or directory. it contains control and access history information used
by the file system, cached EAs and ACLs or pointers to same, a truncated
copy of the file or directory name (to aid disk repair programs, and an
allocation structure which defines the size and location of the file's storage.
<p>
<center>
<a href="fig3.gif" name="fig3">
<img src="fig3.gif" alt="[Fig. 3]" border=0></a>
</center>
<p>
<b>FIGURE 3</b>:
The simplest form of tracking for the sectors owned by a file is shown. The
Fnode s allocation structure points directly to as many as eight sector runs.
Each run pointer consists of a pair of 32-bit doublewords: a starting sector
number and a length !n sectors.
<p>
<center>
<a href="fig4.gif" name="fig4">
<img src="fig4.gif" alt="[Fig. 4]" border=0></a>
</center>
<p>
<b>FIGURE 4</b>:
This figure demonstrates the technique used to track the sectors owned by a
file with 9-480 sector runs. The allocation structure in the Fnode holds the
roots for a B+ Tree of allocation sectors. Each allocation sector can describe
as many as 40 sector runs. If the file requires more than 480 sector runs,
additional intermediate levels are added to the B+ Tree, which increases the
number of possible sector runs by a factor of sixty for each new level.
<p>
<hr>
&lt; <a href="hpfs_vol.html">[HPFS Volume Structure]</a> |
<a href="hpfs.html">[HPFS Home]</a> |
<a href="dirs.html">[Directories]</a> &gt;
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

View File

@@ -0,0 +1,70 @@
<html><head>
<title>HPFS: HPFS Volume Structure</title>
</head>
<body>
<center>
<h1>HPFS Volume Structure</h1>
</center>
HPFS volumes are a new partition type--type 7--and can exist on a fixed disk
alongside of the several previously defined FAT partition types.
IBM-compatible HPFS volumes use a sector size of 512 bytes and have a maximum
size of 2199Gb (232 sectors).
Although there is no particular reason why floppy disks can't be formatted
as HPFS volumes Microsoft plans to stick with FAT file systems on floppy disks
for the foreseeable future.
(This ensures that users will be able to transport files easily between MS-DOS
and OS/2 systems.)
An HPFS volume has very few fixed structures (<a href="#fig1">Figure 1</a>).
Sectors 0-15 of a volume (8Kb) are the Bootblock and contain a volume name,
32-bit volume ID, and a disk bootstrap program. The bootstrap is relatively
sophisticated (by MS-DOS standards) and can use the HPFS in a restricted
mode to locate and read the operating system files wherever they might be found.
Sectors 16 and 17 are known as the Super Block and the Spare Block respectively.
The Super Block is only modified by disk maintenance utilities.
It contains pointers to the free space bitmaps the bad block list the directory
block band and the root directory.
It also contains the date that the volume was last checked out and repaired
with CHKDSK /F. The Spare Block contains various flags and pointers that
will be discussed later it is modified although infrequently as the system
executes. The remainder of the disk is divided into 8Mb bands.
Each band has its own free space bitmap in which a bit represents each sector.
A bit is 0 if the sector is in use and 1 if the sector is available.
The bitmaps are located at the head or tail of a band so that two bitmaps are
adjacent between alternate bands. This allows the maximum contiguous free space
that can be allocated to a file to be 16Mb. One band located at or toward the
seek center of the disk is called the directory block band and receives
special treatment (more about this later). Note that the band size is a
characteristic of the current implementation and may be changed in later
versions of the file system.
<p>
<center>
<a href="fig1.gif" name="fig1">
<img src="fig1.gif" alt="[Fig. 1]" border=0></a>
</center>
<p>
<b>FIGURE 1</b>.
This figure shows the overall structure of an HPFS volume.
The most important fixed objects in such a volume are the Bootblock the Super
Block, and the Spare Block.
The remainder of the volume is divided into 8Mb bands.
There is a freespace bitmap for each band and the bitmaps are located between
alternate bands consequently, the maximum contiguous space which can be
allocated to a file is 16Mb.
<p>
<hr>
&lt; <a href="fat.html">[FAT File System]</a> |
<a href="hpfs.html">[HPFS Home]</a> |
<a href="fnodes.html">[Files and Fnodes]</a> &gt;
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

View File

@@ -0,0 +1,81 @@
<html><head>
<title>HPFS: Installable File Systems</title>
</head>
<body>
<center>
<h1>Installable File Systems</h1>
</center>
Support for installable file systems has been one of the most eagerly
anticipated features of OS/2 Version 1.2. it will make it possible to
access multiple incompatible volume structures--FAT, HPFS, CD ROM, and
perhaps even UNIX on the same OS/2 system at the same time, will
simplify the life of network implementors, and will open the door to
rapid file system evolution and innovation. Installable file systems
are, however, only relevant to the HPFS insofar as they make use of
the HPFS optional. The FAT file system is still embedded in the OS/2
kernel, as it was in OS/2 1.0 and 1.1, and will remain there as the
compatibility file system for some time to come. An installable file
system driver (FSD) is analogous in many ways to a device driver. An
FSD resides on the disk in a file that is structured like a
dynamic-link library (DLL), typically with a SYS or IFS extension,
and is loaded during system initialization by <tt>IFS=</tt> statements
in the <tt>CONFIG.SYS</tt> file. <tt>IFS=</tt> directives are processed
in the order they are encountered and are also sensitive to the order of
<tt>DEVlCE=</tt> statements for device drivers. This lets you load a
device driver for a nonstandard device, load a file system driver from
a volume on that device, and so on. Once an FSD is installed and
initialized, the kernel communicates with it in terms of logical requests
for file opens, reads, writes, seeks, closes, and so on.
The FSD translates these requests--using control structures and tables
found on the volume itself--into requests for sector reads and writes for
which it can call special kernel entry points called File System Helpers
(FsHlps). The kernel passes the demands for sector I/O to the appropriate
device driver and returns the results to the FSD
(<a href="#fig6">Figure 6</a>).
The procedure used by the operating system to associate volumes with
FSDs is called dynamic mounting and works as follows. Whenever a volume
is first accessed, or after it has been locked for direct access and then
unlocked (for example, by a FORMAT operation), OS/2 presents identifying
information from the volume to each of the FSDs in turn until one of them
recognizes the information. When an FSD claims the volume, the volume is
mounted and all subsequent file I/O requests for the volume are routed to
that FSD.
<p>
<center>
<a href="fig6.gif" name="fig6">
<img src="fig6.gif" alt="[Fig. 6]" border=0></a>
</center>
<p>
<b>FIGURE 6</b>:
A simplified sketch of the relationship between an application program, the
OS/2 kernel, an installable file system, a disk drlver, and the physical
disk device. The applicatIon issues logical file requests to the OS/2 kernel
by callng the entry points for DosOpen, DosRead, DosWrlte, DosChgFilePtr, and
so on. The kernel passes these requests to the appropriate installable file
system for the volume holding the file. The installable file system translates
the logical file requests into requests for reads or writes of logical sectors
and calls a kernel File System Helper (FsHlp) to pass these requests to the
appropriate disk drlver. The disk driver transforms the logical sector requests
into requests for specific physical units, cylinders heads, and sectors, and
issues commands to the disk adapter to transfer data between the disk and
memory.
<p>
<hr>
&lt; <a href="ea.html">[Extended Attributes]</a> |
<a href="hpfs.html">[HPFS Home]</a> |
<a href="perform.html">[Performance Issues]</a> &gt;
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

View File

@@ -0,0 +1,38 @@
<html><head>
<title>HPFS</title>
</head>
<body>
<center>
<h1>HPFS</h1>
<b>High Performance File System</b>
</center>
<p>
Also available in OS/2 IPF format: <a href="hpfs.inf">hpfs.inf</a>.
<ul>
<li><a href="design.html">Design Goals and implementation of the new
High Performance File System</a>
<li><a href="fat.html">FAT File System</a>
<li><a href="hpfs_vol.html">HPFS Volume Structure</a>
<li><a href="fnodes.html">Files and Fnodes</a>
<li><a href="dirs.html">Directories</a>
<li><a href="ea.html">Extended Attributes</a>
<li><a href="ifs.html">Installable File Systems</a>
<li><a href="perform.html">Performance issues</a>
<li><a href="faultol.html">Fault Tolerance</a>
<li><a href="app_hpfs.html">Application Programs and the HPFS</a>
<li><a href="sum.html">Summary</a>
<p>
<li><a href="figs.html">Collection of Illustrations</a>
</ul>
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

View File

@@ -0,0 +1,92 @@
<html><head>
<title>HPFS: Performance issues</title>
</head>
<body>
<center>
<h1>Performance issues</h1>
</center>
The HPFS attacks potential bofflenecks in disk throughput at multiple levels.
It uses advanced data structures contiguous sector allocation, intelligent
caching, read-ahead, and deffered writes in order to boost performance.
First, the HPFS matches its data structures to the task at hand:
sophisticated data structures (B-Trees and B+ Trees) for fast random access
to filenames, directory names, and lists of sectors allocated to files or
directories, and simple compact data structures (bitmaps) for locating chunks
of free space of the appropriate size. The routines that manipulate these
data structures are written in assembly language and have been painstakingly
tuned, with special focus on the routines that search the freespace bitmaps
for patterns of set bits (unused sectors). Next, the HPFS's main goal --its
prime directive, if you will -- is to assign consecutive sectors to files
whenever possible. The time required to move the disk's readowrite head from
one track to another far out-weighs the other possible delays, so the HPFS
works hard to avoid or minimize such head movements by allocating file space
contiguously and by keeping control structures such as Fnodes and freespace
bitmaps near the things they control.
<p>
Highly contiguous files also help the file system make fewer requests of the
disk driver for more sectors at a time, allow the disk driver to exploit the
multisector transfer capabilities of the disk controller, and reduce the
number of disk completion interrupts that must be serviced. Of course, trying
to keep files from becoming fragmented in amultitasking system in which many
files are being updated concurrently is no easy chore. One strategy the HPFS
uses is to scatter newly created files across the disk--in separate bands,
if poosible-so that the sectors allocated to the files as they are extended
will not be interleaved. Another strategy is to reallocate approximately 4Kb
of contiguous space to the file each time it must be extended and give back
any excess when the file is closed. If an application knows the ultimate size
of a new file in advance, it can assist the file system by specifying an
initial file allocation when it creates the file. The system will then search
all the free space bitmaps to find a run of consecutive sectors large enough
to hold the file. That failing, it will search for two runs that are half
the size of the file, and so on.
<p>
The HPFS relies on several different kinds of caching to minimize the number
of physical disk transfers it must request. Naturally, it caches sectors, as
did the FAT file system. But unlike the FAT file system, the HPFS can manage
very large caches efficiently and adjusts sector caching on a per handle basis
to the manner in which a file is used. The HPFS also caches path names and
directories, transforming disk directory entries into an even more compact and
efficient in-memory representation. Another technique that the HPFS uses to
improve performance is to preread data it believes the program is likely to
need. For example, when a file is opened, the file system will pre-read and
cache the Fnode and the first few sectors of the file's contents. If the file
is an executable program or the history information in the file's Fnode shows
that an open operation has typically been followed by an immediate sequential
read of the entire file, the file system will preread and cache much more of
the file's contents. When a program issues relatively small read requests, the
file system always fetches data from the file in 2Kb chunks and caches the
excess, allowing most read operations to be satisfied from the cache. Finally,
the OS/2 operating system's support for multitasking makes it possible for the
HPFS to rely heavily on lazy writes (sometimes called deferred writes or write
behind) to improve performance. When a program requests a disk write, the data
is placed in the cache and the cache buffer is flagged as dirty (that is,
inconsistent with the state of the data on disk). When the disk becomes idle
or the cache becomes saturated with dirty buffers, the file system uses a
captive thread from a daemon process to write the buffers to disk, starting
with the oldest data. In general, lazy writes mean that programs run faster
because their read requests will almost never be stalled waiting for a write
request to complete. For programs that repeatedly read, modify, and write a
small working set of records, it also means that many unnecessary or redundant
physical disk writes may be avoided. Lazy writes have their dangers, of course,
so a program can defeat them on a per-handle basis by setting the write-through
flag in the Open Mode parameter for DosOpen or it can commit data to disk on a
per-handle basis with the DosBufReset function.
<p>
<hr>
&lt; <a href="ifs.html">[Installable File Systems]</a> |
<a href="hpfs.html">[HPFS Home]</a> |
<a href="faultol.html">[Fault Tolerance]</a> &gt;
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

View File

@@ -0,0 +1,147 @@
<html><head>
<title>HPFS: Summary</title>
</head>
<body>
<center>
<h1>Summary</h1>
</center>
The HPFS solves all of the historical problems of the FAT file system. it
achieves excellent throughput even in extreme casses--many very small files
or a few very large files--by means of advanced data structures and techniques
such as intelligent caching read-ahead and write-behind. Disk space is used
economically because it is managed on a sector basis. Existing application
programs will need modification to take advantage of the HPF'S's support for
extended attributes and long filenames but these changes will not be difficult.
All application programs will benefit from the HPFS's improved performance and
decreased CPU use whether they are modified or not. This article is based on a
prerelease version of the HPFS that was still undergoing modification and
tuning. Therefore the final release of the HPFS may differ in some details
from the description given here.
<p>
Most programmers are at at least passingly famiIiar with the data structure
known as a binary tree. Binary trees are a technique for imposing a logical
ordering on a collection of data items by means of pointers, without regard
to the physical order of the data. In a simple binary tree, each node contains
some data, including a key value that determines the node's logical position
in the tree, as well as pointers to the node's left and right sub trees. The
node that begins the tree is known as the root: the nodes that sit at the
ends of the tree's branches are sometimes called the leaves. To find a
particular piece of data, the binary tree is traversed from the root. At each
node, the desired key is compared with the node's key: if they don't match,
one branch of the node's sub tree or another is selected based on whether the
desired key is less than or greater than the node's key. This process
continues until a match is found or an empty sub tree is encountered
(see <a href="#figa">Figure A</a>).
Such simple binary trees, although easy to understand and implement, have
disadvantages in practice. If keys are not well distributed or are added to
the tree in a non-random fashion, the tree can become quite asymmetric,
leading to wide variations in tree traversal times. In order to make access
times uniform, many programmers prefer a particular type of balanced tree
known as a B- Tree. For the purposes of this discussion, the important
points about a B-Tree are that data is stored in all nodes, more than one
data item might be stored in a node, and all of the branches of the tree
are of identical length (see <a href="#figb">Figure B</a>).
The worst-case behavior of a B-Tree is predictable and much better than that
of a simple binary tree, but the maintenance of a B-Tree is correspondingly
more complex. Adding a new data item, changing a key value, or deleting a
data item may result in the splitting or merging of a node, which in turm
forces a cascade of other operations on the tree to rebalance it. A B+ Tree
is a specialized form of B-Tree that has two types of nodes: internal, which
only point to other nodes, and external, which contain the actual data
(see <a href="#figc">Figure C</a>).
The advantage of a B+ Tree over a B- Tree is that the internal nodes of the
B+Tree can hold many more decision values than the intenmediate-level nodes
of a B-Tree, so the fan out of the tree is faster and the average length of
a branch is shorter. This makes up for the fact that yell must always follow
a B+ Tree branch to its end to get the data for which you are looking, whereas
in a B-Tree you may discover the data at an interme-diate code or even at the
root.
<table><tr>
<th> </th><th align=left>FAT File System </th><th align=left>High Performance File System</th>
</tr><tr>
<td>Maximum filename length </td><td>11(in 8.3 format) </td><td>254</td>
</tr><tr>
<td>Number of dot (.) delimeters allowed </td><td>One </td><td>Multiple</td>
</tr><tr>
<td>File Attributes </td><td>Bit flags </td><td>Bit flags plus up to 64Kb of free-form ASCll of binary information</td>
</tr><tr>
<td>Maximium Path Length </td><td>64 </td><td>260</td>
</tr><tr>
<td>Miniumum disk space overhead per file </td><td>Directory entry (32 bytes) </td><td>Directory entry (length varies) + Fnode (512 bytes)</td>
</tr><tr>
<td>Average wasted space per file </td><td>1/2 cluster (typically 2Kb or more) </td><td>1/2 sector (256 bytes)</td>
</tr><tr>
<td>Minimum alocation unit </td><td>Cluster (typically 4Kb or more) </td><td>Sector (512 bytes)</td>
</tr><tr>
<td>Allocation info for files </td><td>Centralized in FAT on home track </td><td>Located nearby each file in its Fnode</td>
</tr><tr>
<td>Free disk space info </td><td>Centralized in FAT on home track </td><td>Located near free space in bitmaps</td>
</tr><tr>
<td>Free disk space described per byte </td><td>2Kb ( 1/2 cluster at 8 sectors /clustor)
</td><td>4Kb (8 sectors)</td>
</tr><tr>
<td>Directory structure </td><td>Unsorted linear list, must be searched exhaustivily
</td><td>Sorted B-Tree</td>
</tr><tr>
<td>Directory Location </td><td>Root directory on home track, others scattered
</td><td>Localized near seek center of volume</td>
</tr><tr>
<td>Cache replacement strategy </td><td>Simple LRU
</td><td>Modified LRU, sensitive to data type and usage history</td>
</tr><tr>
<td>Read ahead </td><td>None in MS-DOS 3.3 or earlier, primitive read-ahead optional in MS-DOS 4
</td><td>Always present, sensitive to data type and usage history</td>
</tr><tr>
<td>Write behind </td><td>Not available </td><td>Used by default, but can be defeated on per-handle basis</td>
</tr></table>
<p>
<center>
<a href="figa.gif" name="figa">
<img src="figa.gif" alt="[Fig. A]" border=0></a>
</center>
<p>
<b>FIGURE A</b>:
To find a piece of data, the binary tree is traversed from the root untill
the data is found or an empty subtree is encountered.
<p>
<center>
<a href="figb.gif" name="figb">
<img src="figb.gif" alt="[Fig. B]" border=0></a>
</center>
<p>
<b>FIGURE B</b>:
In a balanced B-Tree, data is stored in nodes, more than one data item can
be stored in a node, and all branches of the tree are the same length.
<p>
<center>
<a href="figc.gif" name="figc">
<img src="figc.gif" alt="[Fig. C]" border=0></a>
</center>
<p>
<b>FIGURE C</b>:
A B+ Tree has internal nodes that point to other nodes and external nodes
that contain actual data.
<p>
<hr>
&lt; <a href="app_hpfs.html">[Application Programs and HPFS]</a> |
<a href="hpfs.html">[HPFS Home]</a>
<hr>
<font size=-1>
Html'ed by <a href="http://www.seds.org/~spider/">Hartmut Frommert</a>
</font>
</body></html>

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.3 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.0 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.4 KiB

View File

@@ -0,0 +1,238 @@
<H1>Inside the High Performance File System</H1>
<H2>Part 0: Preface</H2>
Written by Dan Bridges
<H2>Introduction</H2>
<P>
I am not a programmer's backside but I am an enthusiast interested in
finding out more about HPFS. There is so little detailed information
available on HPFS that I think you will find this modest series
instructive. The REXX programs to be presented are functional but they
are not particularly pleasing in an aesthetic sense. However they do
ferret out information and will help you to understand what is going on.
I'm sure that a programming guru, once motivated, could come up with
superior versions. Hopefully they will. This installment originally
appeared at the OS2Zone web site (http://www.os2zone.aus.net).
<P>
I've been asked [by someone else. Ed.] to write a preface to this series.
Normally I prefer to write on little-covered topics whereas much of what I'm
going to discuss in this installment often appears in a cursory examination of
the HPFS. The trouble with most of what has been written about HPFS in books on
OS/2 is that the topic is never considered very deeply. After finishing working
your way through this series (still being written on a monthly basis, but
expected to occupy eight parts including this one) you will have a detailed
knowledge of the structures of the HPFS. Having said that, there is a place for
some initial information for readers who currently know very little about the
subject.
<P>
<H2>File Systems</H2>
<P>
A File System (FS) is a combination of hardware and software that
enables the storage and retrieval of information on removable (floppy
disk, tape, CD) and non-removable (HD) media. The File Allocation Table
FS (FAT) is used by DOS. It is also built into OS/2. Now FAT appeared
back in the days of DOS v1 in 1981 and was designed with a backward
glance to CP/M. A hierarchical directory structure arrived with DOS v2
to support the XT's 10 MB HD. OS/2 v1.x used straight FAT. OS/2 v2.x
and later provide "Super FAT". This uses the same layout of
information on the storage medium (e.g. a floppy written under OS/2 v2
can easily be read by a DOS system) but adds performance improvements to
the software used to transfer the data. Super FAT will be covered in
Part 1.
<P>
<H2>FAT</H2>
<P>
Figure 1 shows the layout of a FAT volume. There are two copies of the
FAT. These should be identical. This may seem like a safety feature
but it only works in the case of physical corruption (if a bad sector
develops in one of the sectors in a FAT, the other one is automatically
used instead) not for logical corruption. So if the FS gets confused
and the two copies are not the same there is no easy way to determine
which copy is still O K.
<P>
<IMG SRC="hpfs1.gif" WIDTH=498 HEIGHT=64>
<P>
<FONT SIZE=2>
Figure 1: The layout of a volume formatted with the FAT file system.
Note: this diagram is not to scale. The data area is quite large in
practice.
</FONT>
<P>
The root directory is made a fixed known size because the system files
are placed immediately after it. The known location for the initial
system files enables DOS or OS/2 to commence loading itself. (The boot
record, which loads first off, is small and only has enough space for
code to find the initial system files at a known location.) However
this design decision also limits the number of files that can be listed
in the root directory of a FAT volume.
<P>
Entries in the root directory and in subdirectories are not ordered so
searching for a particular file can take some time, particularly if
there are many files in a directory.
<P>
The FAT and the root directory are positioned at the beginning of the
volume (on a disk this is typically on the outside). These entries are
read often, particularly in a multitasking environment, requiring a lot
of relatively slow (in CPU terms) head movement.
<P>
<H2>How Files are Stored on a FAT Volume</H2>
<P>
Files are stored on a FAT volume using the FS' minimum allocation unit,
the cluster (1-64 sectors). A 32-byte directory entry only provides
sufficient space for a 8.3 filename, file attributes, last alteration
date/time, filesize and the starting cluster. See Figure 2.
<P>
<IMG SRC="hpfs2.gif" WIDTH=388 HEIGHT=209>
<P>
<FONT SIZE=2>
Figure 2: The layout of the 32 bytes in a directory entry in a FAT
system.
</FONT>
<P>
The corresponding initial cluster entry in the FAT then points to the
next FAT entry for the second cluster of the file (assuming that the
file was big enough) which in turn points to the next cluster and so on.
FAT entries can be 16-bit (max. FFFFh) or 12-bit (max. FFFh) in size,
with volumes less than 16 MB using the 12-bit scheme. FAT entries can
be of four types:
<UL>
<LI>Contain 0000h if the cluster is free (available);
<LI>Contain the number of the next cluster in the chain;
<LI>If this is the last cluster in the chain then the FAT entry will
consist of a character which signifies the end of the chain (EOF);
<LI>Another special character if the cluster of the disk is bad
(unreliable).
</UL>
<P>
The FAT FS is prone to fragmentation (i.e. a file's clusters are not in
one, contiguous chain) in a single-tasking environment because the FAT
is searched sequentially for the next free entry in the FAT when a file
is written, regardless of how much needs to be written. The situation
is even worse in a multitasking environment because you can have more
than one writing operation in progress at the same time. See Figures 3
and 4 for an example of a fragmented file under FAT.
<P>
<IMG WIDTH=391 HEIGHT=238 SRC="hpfs3.gif">
<P>
<FONT SIZE=2>
Figure 3: The layout of a contiguous file in the FAT.
</FONT>
<P>
<IMG WIDTH=458 HEIGHT=232 SRC="hpfs4.gif">
<P>
<FONT SIZE=2>
Figure 4: An example of a fragmented file under FAT in three pieces.
</FONT>
<P>
The FAT FS uses a singly-linked scheme i.e. the FAT entry points only
to the next cluster. If, for some reason, the chain is accidentally
broken (the next cluster value is corrupted) then there is no
information in the isolated next cluster to indicate what it was
previously connected to. So the FAT FS, while relatively simple, is
also rather vulnerable.
<P>
FAT was designed in the days of small disk size and today it really
shows its age. The maximum number of entries (clusters) in a 16-bit FAT
is just under 64K (due to technical reasons, the actual maximum is
65,518). Since we can't increase the number of clusters past this
limit, a large volume requires the use of large cluster sizes. So, for
example, a volume in the 1-2 GB range has 32 KB clusters. Now a cluster
is the minimum allocation unit so a 1 byte file on such a volume would
consume 32 KB of space, a 33 KB file would consume 64 KB and so on. A
rough assumption you can make is that, on average, half a cluster of
space is wasted per file. You can run CHKDSK on a FAT volume, note the
total number of files and also the allocation unit size and then
multiply these two figures together and divide the result by 2 to get
some idea of the wastage. The situation is quite different with HPFS as
you will see when you read Part 1.
<P>
Finally, FAT under OS/2 supports Extended Attributes (EAs - up to 64 KB
of extra information associated with a file), but since there is very
little extra space in a 32-byte directory entry it is only possible to
store a pointer into an external file with all EAs on a volume being
stored in this file ("EA DATA. SF"). In general it is fair to state
that EAs are tacked on to FAT. With HPFS the integration is much
better. If the EA is small enough HPFS stores it completely within the
file's FNODE (every file and directory has an FNODE). Otherwise EAs is
stored outside the file but closely associated with it and usually
situated physically close to the file for performance reasons. Some
users have occasionally reported crosslinking of EAs under FAT. This
can be quite a serious matter requiring reinstallation of the operating
system. I've not heard of this occurring under HPFS. Note that the
WorkPlace Shell relies heavily on EAs.
<P>
<H2>HPFS</H2>
<P>
HPFS is example of a class of file systems known as Installable File
Systems (IFS). Other types of IFS include CD support (CDFS), Network
File System (NFS), Toronto Virtual File System (TVFS - combines FS
elements of VM, namely CMS search path, with elements of UNIX, namely
symbolic link), EXT2-OS (read Linux EXT2FS partitions under OS/2) and
HPFS386 (with IBM LAN Server Advanced).
<P>
An IFS is installed at start-up time. The software to access the actual
device is specified as a device driver (usually BASEDEV=xxxxx.DMD/.ADD)
while a Dynamic Link Library (DLL) is load to control the format/layout
of the data (with IFS=xxxxx.IFS). OS/2 can run more than one IFS at a
time so you could, for example, copy from a CD to a HPFS volume in one
session while reading a floppy disk (FAT) in another session.
<P>
HPFS has many advantages over FAT: Long Filename (254 characters
including spaces); excellent performance when directories containing
many files; designed to be fault tolerant; fragmentation resistant;
space efficient with large partitions; works well in a multitasking
environment. These topics will be explored in the series.
<P>
<H2>REXX</H2>
<P>
One of the many benefits of using OS/2 is that it comes with REXX
(providing you install it - it requires very little extra space). REXX
is a surprisingly versatile and powerful scripting language and there
are oodles of REXX programs and add-ons available, much of it for free.
This series presents REXX programs that access HPFS structures and
decode their contents.
<P>
<H2>Conclusion</H2>
<P>
In this installment you have seen that the FAT FS has a number of
problems related to its ancient origins. HPFS comes from a fresh design
with one eye on likely advances in storage that would occur in the
foreseeable future and the other eye on obtaining good performance. In
the next installment we look at the many techniques HPFS uses to achieve
its better performance.
</BODY>
</HTML>

View File

@@ -0,0 +1,800 @@
<H1>Inside the High Performance File System</H1>
<H2>Part 1: Introduction</H2>
Written by Dan Bridges
<H2>Introduction</H2>
<P>
This article originally appeared in the February 1996 issue of
Significant Bits, the monthly magazine of the Brisbug PC User Group Inc.
<P>
It is sad to think that most OS/2 users are not using HPFS. The main
reason is that unless you own the commercial program Partition Magic,
switching to HPFS involves a destructive reformat and that most users
couldn't be bothered (at least initially). Another reason is user
ignorance of the numerous technical advantages of using HPFS.
<P>
This month we start a series that delves into the structures that make
up OS/2's HPFS. It is very difficult to get any public information on
it aside from what appeared in an article written by Ray Duncan in the
September '89 issue of Microsoft Systems Journal, Vol 4 No 5. I suspect
that the IBM-Microsoft marriage break-up that occurred in 1991 may have
caused an embargo on further HPFS information. I've been searching
books and the Internet for more than a year looking for information with
very little success. You usually end up finding a superficial
description without any detailed discussion of the internal layout of
its structures.
<P>
There are three commercial utilities that I've found very useful. SEDIT
from the GammaTech Utilities v3 is a wonder. It decodes quite a bit of
the information in HPFS' structures. HPFSINFO and HPFSVIEW from the
Graham Utilities are also good. HPFSINFO lists information gleaned from
HPFS' SuperBlock and SpareBlock sectors, while HPFSVIEW provides the
best visual display I've seen of the layout of a HPFS partition. You
can receive some information on a sector by clicking on it. HPFSVIEW is
also freely available in the demo version of the Graham Utilities,
GULITE.xxx. I've also written a REXX program to assist with
cross-referencing locations between SEDIT & HPFSVIEW and to provide a
convenient means of dumping a sector.
<P>
Probably the most useful program around at the moment is freeware,
FST03F.xxx (File System Tool) written by Eberhard Mattes. This provides
lots of information and comes with source. Even if you aren't a C
programmer (I'm not) you can learn much from its definition of
structures. Unfortunately I wrote the first three instalments without
seeing this information so that made the task more difficult.
<P>
In the early stages I've had to employ a very laborious process in an
attempt to learn more. I created the smallest OS/2 HPFS partition
possible (1 MB). Then I created/altered a file or directory and
compared the changes. Sometimes I knew where the changes would occur so
I could just compare the two sectors but often I ended up comparing two
1 MB image files looking for differences and then translated the location
in the image into C/H/S (a physical address in Cylinder/Head/Sector
format) or LSN (Logical Sector Number). While more information will be
presented in this series than I've seen in the public domain, there are
still things that I've been unable to decipher.
<P>
<H2>The Win95 Fizzer</H2>
<P>
For me, the most disappointing feature of Win 95 is the preservation of
the FAT (File Allocation Table) system. It's now known as VFAT but
aside from integrated 32-bit file and disk access, the structure on the
disk is basically the same as DOS v4 (circa 1988). An ungainly method
involving the volume label file attribute was used to graft long
filename support onto the file system. These engineering compromises
were made to most easily achieve backward compatibility. It's a pity
because Microsoft has an excellent file system available in NT, namely
NTFS. This file system is very robust although perhaps NTFS is overkill
for the small user.
<P>
The Program Manager graphical user interface (GUI) appeared in OS/2 v1.1
in 1988. The sophisticated High-Performance File System came with OS/2
v1.2 which was released way back in 1989! The powerful REXX scripting
language showed up in OS/2 v1.3 (1991). And the largely
object-orientated WPS (Work Place Shell) GUI appeared in 1992 in OS/2
v2.0. So it is hardly surprising that experienced OS/2 users were not
swept up in the general hysteria about Windows 95 being the latest and
greatest.
<P>
A positive aspect of the Win 95 craze has been that the minimum system
requirement of 8 MB RAM, 486/33 makes a good platform for OS/2 Warp. So
now the disgruntled Win 95 user will find switching OSs less daunting,
at least from a hardware viewpoint.
<P>
<H2>Dual Boot and Boot Manager</H2>
<P>
I've never used Dual Boot because it seems so limiting. I've always
reformatted and installed Boot manager so that I could select from up to
four different Operating Systems, for example OS/2 v2.1, OS/2 Warp
Connect (peer-to-peer networking with TCP/IP and Internet support), IBM
DOS v7 and Linux.
<P>
In previous OS/2 installations, I've left a small (50 MB) FAT partition
that could be seen when I booted under either DOS or OS/2, while the
rest of the HD space (910 MB) was formatted as HPFS. Recently I
upgraded to Warp Connect and this time I dropped FAT and the separate
DOS boot partition completely. This does not mean I am unable to run
DOS programs. OS/2 has inbuilt IBM DOS v5 and you can install boot
images of other versions of DOS, or even CP/M, for near instantaneous
booting of these versions. There is no reason why you can't have
multiple flavours of DOS running at the same time as you're running
multiple OS/2 sessions. Furthermore DOS programs have no problems
reading from, writing to or running programs on HPFS partitions even
though the layout is nothing like FAT. It's all handled transparently
by OS/2. But this does mean you have to have booted OS/2 first. HPFS
is not visible if you use either Dual Boot or Boot Manager to boot
directly to DOS, but there are a number of shareware programs around to
allow read-access to HPFS drives from DOS.
<P>
DOS uses the system BIOS to access the hard disk. This is limited to
dealing with a HD that has no more than 1,024 cylinders due to 10 bits
(2^10 = 1,024) being used in the BIOS for cylinder numbering. OS/2 uses
the system BIOS at boot time but then completely replaces it in memory
with a special Advanced BIOS. This means that the boot partition and,
if you use it, Boot Manager's 1 MB partition, must be within the first
1,024 cylinders. Once you've booted OS/2, however, you can access
partitions on cylinders past the Cyl 1023 point (counting from zero)
without having to worry about LBA (Logical Block Addressing) translation
schemes.
<P>
Now this can still catch you out if you boot DOS. On my old system I'd
sometimes use Boot Manager to boot a native DOS version. I'd load AMOS
(a shareware program) to see the HPFS drives. I thought there must have
been a bug in AMOS because I could only see half of F: and none of G:
until I realised that these partitions were situated on a third HD that
had 1,335 cylinders. So this was just the effect of DOS' 1,024 cylinder
limitation which the AMOS program was unable to circumvent.
<P>
<H2>Differences between an Easy and an Advanced Installation</H2>
<P>
Most new OS/2 users select the "Easy Installation" option. This is
satisfactory but it only utilises FAT, installs OS/2 on the same drive
as DOS and Windows, does not reformat the partition and Dual Boot is
installed.
<P>
If you know what you're doing or are more aggressive in wanting to take
advantage of what OS/2 can provide then the "Advanced Installation"
option is for you. Selecting it enables you to selectively install
parts of OS/2, install OS/2 in a primary or logical (extended) partition
other than C: or even on a 2nd HD (I don't know whether you can install
on higher physical drives than the 2nd one in a SCSI multi-drive setup);
the option of installing Boot Manager is provided; you can use HPFS if
you wish; installation can occur on a blank HD.
<P>
<H2>FAT vs HPFS: If Something Goes Wrong</H2>
<P>
CHKDSK on a HPFS partition can recover from much more severe faults than
it can on a FAT system. This is because the cluster linkages in a FAT
system are one-way, pointing to the next cluster in the chain. If the
link is broken it is usually impossible to work out where the lost
clusters ("x lost clusters in y chains") should be reattached. Often
they are just artifacts of a program's use of temporary files that
haven't been cleaned up properly. But "file truncated" and
"cross-linked files" messages are usually an indication of more serious
FAT problems.
<P>
HPFS uses double linking: the allocation block of a directory or file
points back to its predecessor ("parent") as well as to the next element
("child"). Moreover, major structures contain dword (32-bit) signatures
identifying their role and each file/directory's FNODE contains the
first 15 characters of its name. So blind scanning can be performed by
CHKDSK or other utilities to rebuild much of the system after a
significant problem.
<P>
As a personal comment, I've been using HPFS since April, 1993, and I've
yet to experience any serious file system problems. I've had many OS/2
lockups while downloading with a DOS comms program and until recently
I was running a 4 MB hardware disk cache with delayed writes, yet,
aside from the lost download file, the file system has not been
permanently corrupted.
<P>
<H2>Warp, FORMAT /FS:HPFS, CHKDSK /F:3 and The Lazarus Effect</H2>
<P>
Warp, by default, does a quick format when you format a HD under either
FAT or HPFS. So FORMAT /FS:HPFS x:, which is what the installation
program performs if you decide to format the disk with HPFS, is
performed very quickly. It's almost instantaneous if you decide to
reformat with FAT (/FS:FAT). Now this speed differential does not mean
that FAT is much quicker, only that FORMAT has very little work to
perform during a quick FAT reformat since the FAT structures are so
simple compared to HPFS.
<P>
As mentioned earlier, CHKDSK has extended recovery abilities when
dealing with HPFS. It has four levels of /F:n checking/recovery. These
will be considered in greater detail in a later article in this series
when we look at fault tolerance. The default of CHKDSK /F is equivalent
to using /F:2. If you decide to use /F:3 then CHKDSK will dig deep and
recover information that existed on the partition prior to the
reformatting providing that it was previously formatted as HPFS. Using
CHKDSK /F:3 after performing a quick format on a partition that was
previously FAT but is now HPFS will not cause this, since none of the
previous data has HPFS signature words embedded at the beginning of its
sectors. However, if you ever use /F:3 after quickly reformatting a
HPFS partition you could end up with a bit of a mess since everything
would be recovered that existed on the old partition and which hadn't
been overwritten by the current contents.
<P>
To guard against this, OS/2 stores whether or not a quick format has
been performed on a HPFS partition in bit 5 (counting from zero) of byte
08h in LSN (Logical Sector Number) 17, the SpareBlock sector. This
particular byte is known as the Partition Status byte, with 20h
indicating that a quick format was performed. Bit 0 of this byte is
also used to indicate whether the partition is "clean" or "dirty" so 21h
indicates that the partition was quick formatted and is currently
"dirty" (these concepts will be covered in a later instalment).
<P>
If you attempt to perform a CHKDSK /F:3 on a quick-formatted partition,
you will receive the following warning:
<PRE>
SYS0641: Using CHKDSK /F:3 on this drive may cause files that existed
before the last FORMAT to be recovered. Proceed with CHKDSK (Y/N)?
</PRE>
<P>
If you type "HELP 641" for further information you'll see:
<PRE>
EXPLANATION: The target drive was formatted in "fast format" mode,
which does not erase all data areas. CHKDSK /F:3 searches data areas
for "lost" files. If a file existed on this drive before the last
format, CHKDSK may find it, and attempt to recover it.
</PRE>
<P>
ACTION: Use CHKDSK /F:2 to check this drive. If you use /F:3, be aware
that files recovered to the FOUND directories may be old files. Also,
if you format a drive using FORMAT /L, FORMAT will completely erase all
old files, and avoid this warning.
<P>
It seems a pity to forego the power of the CHKDSK /F:3 in the future.
As is suggested, FORMAT /L (for "Long" I presume) will completely
obliterate the prior partition's contents, but you can't specify this
during a reinstall. To perform it you need to use FORMAT /L on the
partition before reinstalling. For this to be practical you will
probably need to keep OS/2 and nothing else on a separate partition and
to have a recent tape backup of the remaining volumes' contents. Note:
in my opinion keeping OS/2 on a separate partition is the best way of
laying out a system but make sure you leave enough room for things like
extra postscript fonts and programs that insist on putting things on C:.
<P>
<H2>Capacity</H2>
<P>
Figure 1 shows a table comparing the capacity of OS/2's FAT and HPFS
file systems. The difference in the logical drive numbers arises due to
A: and B: being assigned to floppies which are always FAT. It would
be ridiculous to put a complex, relatively large file system, which was
designed to overcome FAT's limitations with big partitions, on volumes
as small as current FDs.
<PRE>
FAT HPFS
Logical drives 26 24
Num of Partitions 16 16
Max Partition Size 2 GB 64 GB
Max File Size 2 GB 2 GB
Sector Size 512 bytes 512 bytes
Cluster/Block Size 0.5 KB-32 K 512 bytes
</PRE>
<FONT SIZE=2>
Fig.1 Comparing the capacity of FAT and HPFS
</FONT>
<P>
The next point of interest is the much greater partition size supported by HPFS.
HPFS has a maximum possible partition size of about 2,200 GB (2^21 sectors) but
is restricted in the current implementation to 64 GB. (Note: older references
state that the maximum is 512 GB.) I don't know what imposes this limitation.
Note: the effective limitation on partition size is currently around 8 GB.
This is due to CHKDSK's inability to handle a larger partition. I presume this
limitation will be rectified soon as ultra large HDs will become common in the
next year or two.
<P>
The 2 GB maximum filesize limit is common to DOS, OS/2 and 32-bit Unix. A
32-bit file size should be able to span a range of 4 GB (2^32) but the
DosSetFilePtr API function requires that the highest bit be used for indicating
sign (forward or backward direction of movement), leaving 31 for size.
<P>
The cluster size on a 1.4 MB FD is 512 bytes. For a 100 MB HD formatted
with FAT it is 2 KB. Due to the relatively small 64K (2^16) limit on
cluster numbering, as FAT partitions get bigger the size of clusters
must also increase. So for a 1-2 GB partition you end up with whopping
32 KB clusters. Since the average wastage of HD space due to the
cluster size is half a cluster per file, storing 10,000 files on such a
partition will typically waste 160 MB (10,000 * 32 KB / 2).
<P>
HPFS has no such limitation. File space is allocated in sector-sized
blocks unlike the FAT system. A FNODE sector is also always associated
with each file. So for 10,000 files, the wastage due to sector size is
typically 2.5 MB (10,000 * 512 / 2) for the files themselves + 5 MB
consumed by the file's FNODEs = 7.5 MB. And this overhead is constant
whether the HPFS partition is 10 MB or 100 GB.
<P>
This must be balanced against the diskspace consumed by HPFS. Since
HPFS is a sophisticated file system that is designed to accomplish a lot
more than FAT, it correspondingly requires more diskspace than FAT.
Figure 2 illustrates this. You may think that 10 MB for the file system
is too much for a 1,000 MB partition but you should consider this as a
percentage.
<PRE>
System Usage including Disk Space available Allocation Unit
MBR track to user + Fnode for HPFS
FAT/HPFS in KB FAT/HPFS in % FAT/HPFS in KB
10 MB 44/415 99.57/95.95 4/0.5+0.5
100 MB 76/3,195 99.77/96.88 2/0.5+0.5
1000 MB 289(est)/10,430 99.98(est)/98.98 16/0.5+0.5
</PRE>
<FONT SIZE=2>
Fig. 2: Space used by FAT and HPFS on different volumes
</FONT>
<P>
Furthermore, once cluster size wastage is also considered, then the
break-even point (as regards diskspace) for a 1,000 MB partition is
about 2,200 files which isn't very many files. This is based on a 16 KB
cluster size. In the 1,024-2,047 MB partition size range the cluster
size increases to 32 KB so the "crossover" point shifts to only 1,100
files.
<P>
I had to calculate the 1,000 MB FAT partition values since OS/2 wouldn't
let me have a FAT partition situated in the greater than Cyl 1023
region. The 4 KB cluster size of the 10 MB partition is not a misprint.
Below 16 MB, a 12-bit FAT scheme (1.5 bytes in the FAT representing 1
cluster) is used instead of a 16-bit one.
<P>
<H2>Directory Search Speed</H2>
<P>
Consider an extreme case: FAT system on a full partition which has a
maximum-sized FAT (64K entries - this is the maximum number of files a
FAT disk can hold). The size of such a partition would be 128 MB, 256
MB, 512 MB, 1 GB or 2 GB, depending on cluster size. Each FAT is 128 KB
in size. (There is a second FAT which mirrors the first.) In this
example all the files are in one subdirectory. This can't be in the
root directory because it only has space for 512 entries. (With HPFS
you can have as many files as you want in the root directory.) 64 K of
entries in a FAT directory requires 2 MB of diskspace (64K * 32
bytes/directory entry). To find a file, on average, 32 K directory
entries would need to be searched. To say that a file was not on the
disk, the full 64 K entries must be scanned before the "File not found"
message was shown. The same figures would apply in you were using a
file-finding utility to look for a file in 1,024 directories, each
containing 63 files (the subdirectory entry also consumes space).
<P>
If the directory entries were always sorted, the situation would greatly
improve. Assuming you had a quick means of getting to the file in the
sorted sequence, if it's the file you're looking for then you've found
its directory entry (and thus its starting cluster's address). If a
file greater in the sequence than the required file is found instead
then you immediately know that the file does not exist.
<P>
HPFS stores directory files in a balanced multi-branch tree structure
(B-tree) which is always sorted due to the way the branches are
assigned. This can lead to some extra HD activity, caused by adjustment
of the tree structure, when a new file is added or a file is renamed.
This is done to keep the tree balanced i.e. the total length of each
branch from the root to the leaves is the same. The extra work when
writing to the disk is hidden from the user by the use of "lazy writes"
(delayed write caching).
<P>
HPFS directory entries are stored in contiguous directory blocks of four
sectors i.e. 2 KB known as DIRBLKs. A lot of information is stored in
each variable-length (unlike FAT) file entry in a DIRBLK structure,
namely:
<UL>
<LI>The length of the entry;
<LI>File attributes;
<LI>A pointer to the HPFS structure (FNODE; usually just before the
first sector of a file) that describes the sector disposition of the
file;
<LI>Three different date/time stamps (Created, Last Accessed, Last
Modified);
<LI>Usage count. Although mentioned in the 1989 document, this has not
have been implemented;
<LI>The length of the name (up to 254 characters);
<LI>A B-tree pointer to the next level of the tree structure if there
are any further levels. The pointer will be to another directory
block if the directory entries are too numerous to fit in one 2 KB
block;
</UL>
<P>
At the end of the sector there is extra ("flex") space available for
special purposes.
<P>
If the average size of the filenames is 10-13 characters, then a
directory block can store 44 of them (11 entries/sector). A two-level
B-tree arrangement can store 1,980 entries (1 * 44-entry directory root
block + 44 directory leaf blocks * 44 entries/block) while a three-level
structure could accommodate 87,164 files (the number of files in the
two-level tree + 1,936 third-level directory leaf blocks * 44
entries/block). So the 64 K of directory entries in our example can be
searched in a maximum of 3 "hits" (disk accesses). The term "maximum"
was used because it depends on what level the filename in question is
stored in the B-tree structure and what's in the disk cache.
<P>
Adding files to a directory containing many files (say 500+) under FAT
becomes an exasperating affair. I've often experienced this because a
DOS program we've installed on hundreds of our customer's machines has
648 files in a sub-sub-subdirectory. Watching the archive unpack on a
machine without disk caching is bad news and it still slows down
noticeably on machines with large SMARTDRIVE caches.
<P>
Figure 3 shows a simple REXX program you can create to investigate this
phenomenon while Figure 4 tables some results. The program creates a
large number of zero-length files in a directory. Perform this test in
a subdirectory to overcome FAT's restriction on a maximum of 512 entries
in the root directory. Reformating and rebooting was performed before
each test to ensure consistent conditions. With both FAT and HPFS, a
1,536 KB lazy-writing cache with a maximum cacheable read/write size of
8 KB was used. Note 1: with HPFS, a "zero-length" file consumes
diskspace because there is always a FNODE sector associated with a
file/directory, regardless of the file's contents. So 1,000 empty files
consume 500 KB of space. Note 2: there is a timing slop of about 0.1
seconds due to the 55 msec timer tick uncertainty affecting both the
start time and stop time values.
<PRE>
/* Create or open a large number of empty files in a directory */
CALL Time 'R' /* Reset timer */
DO x = 1 TO 1000
CALL STREAM 'file'||x, 'c', 'open' /* Will create if not exist */
CALL STREAM 'file'||x, 'c', 'close'
END
SAY Time('E') /* Report elapsed time */
</PRE>
<FONT SIZE=2>
Fig 3: A REXX program to assess the directory searching and file
creation speeds of FAT and HPFS.
</FONT>
<PRE>
Number of Files in a Directory
125 250 500 1000 2000 4000 4001
->4100
FAT 1.7 3.4 8.0 23.4 99.4 468.4 26.6
FAT (LW) 0.7 1.7 5.1 17.9 89.6 447.3 26.1
HPFS 7.4 14.7 30.7 62.9 129.0 262.6 7.5
HPFS (LW) 0.5 1.0 2.2 4.5 9.0 18.3 0.5
</PRE>
<FONT SIZE=2>
Fig 4: Timing results of the program in Figure 3. The beneficial effect
of lazy writing on performance is clearly demonstrated. Tests were
performed in an initially empty subdirectory except for the last one
which adds 100 new files to a subdirectory already containing 4,000
files.
</FONT>
<P>
To investigate further, the full data set was plotted on a graph with
logarithmic axes. Examine Figure 5. As you can see, HPFS' performance
is reasonably linear (in y = a*x^b + c, b was actually 1.1) while FAT's
performance appears to follow a third-order polynomial (y = a*x^3 +
b*x^2 + c*x + d). It is apparent that FAT's write caching becomes less
effective when many files are in a directory presumably because much
time is being spent sifting through the FAT in memory. (Disk access was
only occurring briefly about once a second based on the flashing of the
HD light). HPFS' performance was dramatically improved in this test by
the use of write caching. Again, disk access was about once a second
(due to CACHE's /MAXAGE:1000 parameter). While, typically, most disk
access will involve reading rather than writing, this graph shows how
effective lazy writing is at hiding the extra work from the user. It is
also apparent that HPFS handles large numbers of files well. We now
turn to examining how this improvement is achieved.
<P>
<A HREF="fig5.gif">
<IMG WIDTH=100 HEIGHT=57 SRC="fig5_small.gif"></A>
<P>
<FONT SIZE=2>
Fig. 5: Log-log graph comparing file system performance creating test
files in a subdirectory. Extra data points shown. Number of files was
increased using a cube-root-of-2 multiple. (Click for large version.)
</FONT>
<P>
<H2>Directory Location and Fragmentation</H2>
<P>
Subdirectories on a FAT disk are usually splattered all around it.
Similarly, entries in a subdirectory may not all be in contiguous
sectors on the disk. Searching a FAT system's directory structure can
involve a large amount of HD seeking back and forth, i.e. more time.
Sure, you can use a defragger option to move all the directories to the
front of the disk, but this usually takes a lot of time to reshuffle
everything and the next time you create a new subdirectory or add files
to an existing subdirectory there will be no free space up the front so
directory separation and fragmentation will occur again.
<P>
HPFS takes a much better approach. On typical partitions (i.e. not
very small ones) a directory band, containing many DIRBLKs, is placed at
or near the seek centre (half the maximum cylinder number). On a 100 MB
test partition the directory band starts at Cyl 48 (counting from 0) of
a volume that spans 100 cylinders. Here 1,980 contiguous Directory
sectors (just under 1 MB) were situated. Assuming 11 entries per
Directory sector (44 entries per DIRBLK), this means that the first
21,780 directory entries will be next to each other. So if a blind file
search needs to be performed this can be done with just 1 or 2 long disk
reads (assuming &lt;20,000 files and 1-2 MB disk cache). The maximum
size of the contiguous directory band appears to be 8,000 KB for about
176,000 entries with 13-character names. Once the directory band is
completely full new Directory sectors are scattered throughout the
partition but still in four-sector DIRBLKs.
<P>
Another important aspect of HPFS' directory band is its location. By
being situated near the seek centre rather than at the very beginning
(as in FAT), the average distance that the heads must traverse, when
moving between files and directories, is halved. Of course, with lazy
writing, traversals to frequently update a directory entry while writing
to a temporary file, would be much reduced anyway.
<P>
<H2>File Location and Fragmentation</H2>
<P>
HPFS expends a lot of effort to keep a file either in one piece if
possible or otherwise within a minimum number of pieces and close
together on the disk so it can be retrieved in the minimum number of
reads (remembering also that cache read-ahead can take in more than one
nearby piece in the same read). Also, the seek distance, and hence time
required to access extra pieces, is kept to an absolute minimum. The
main design philosophy of HPFS is that mechanical head movement is a
very time-consuming operation in CPU terms. So it is worthwhile doing
more work looking for a good spot on the disk to place the file. There
are many aspects to this and I'm sure there are plenty of nuances of
which I'm ignorant.
<P>
Files are stored in 8 MB contiguous runs of sectors known as data bands.
Each data band has a four-sector (2 KB) freespace bitmap situated at
either the band's beginning or end. Consecutive data bands have
tail-to-head placement of the freespace bitmaps so that maximum
contiguous filespace is 16 MB (actually 16,380 KB due to the presence of
the bitmaps within the adjoining band). See Figure 6.
<P>
<IMG WIDTH=403 HEIGHT=213 SRC="fig6.gif">
<P>
<FONT SIZE=2>
Fig. 6: The basic data layout of an HPFS volume
</FONT>
<P>
Near the start of the partition there is a list of the sectors where
each of the freespace bitmaps commences. I'm sure that this small list
would be kept loaded into memory for performance reasons. Having two
small back-to-back bitmaps adjoining a combined 16 MB data band is
advantageous when HPFS is looking for the size of each freespace region
within bands, prior to allocating a large file. But it does mean that a
fair number of seeks to different bitmaps might need to be performed on
a well-filled disk, in search of a contiguous space. Or perhaps these
bitmaps are also kept memory resident if the disk is not too big.
<P>
A 2 GB file would be split into approximately 128 chunks of 16 MB, but
these chunks are right after each other (allowing for the presence of
the intervening 4 KB of back-to-back freespace bitmaps). So to refer to
this file as "fragmented", while technically correct, would be
misleading.
<P>
As mentioned earlier, every file has an associated FNODE, usually right
before the start of the file. The number of pieces a file is stored in
are referred to as extents. A "zero-length" file has 0 extents; a
contiguous file has 1 extent; a file of 2-8 extents is "nearly"
contiguous (the extents should be close together).
<P>
An FNODE sector contains:
<UL>
<LI>The real filename length;
<LI>The first 15 characters of the filename;
<LI>Pointer to the directory LSN that contains this file;
<LI>EAs (Extended Attributes) are completely stored within the FNODE
structure if the total of the EAs is 145 bytes or less;
<LI>0-8 contiguous sector runs (extents), organised as eight LSN
run-starting-points (dword), run lengths (dword) and offsets into
the file (dword).
</UL>
<P>
A run can be up to 16 MB (back-to-back data bands) in size. If the file
is too big or more fragmented than can be described in 8 extents, then
an ALNODE (allocation block) is pointed to from the FNODE. In this case
the FNODE structure changes so that it now contains up to 12 ALNODE
pointers within the FNODE and each ALNODE can then point to either 40
direct sector runs (extents) or to 60 further ALNODEs, and each of these
lower-level ALNODEs could point to either... and so on.
<P>
If ALNODEs are involved then a modified balanced tree structure called a
B+tree is used with the file's FNODE forming the root of the structure.
So only a two-level B+tree would be required to completely describe a 2
GB (or smaller) file if it consists of less than 480 runs (12 ALNODEs *
40 direct runs described in each ALNODE). Otherwise a 3-level structure
would have no problems since it can handle up to 28,800 runs (12 ALNODEs
* 60 further ALNODEs * 40 direct runs). It's difficult to imagine a
situation where a four or higher level B+tree would ever be needed.
<P>
Consider how much disk activity would be required to work out the layout
of a 2 GB file under FAT and under HPFS. With FAT the full 128 KB of
the FAT must be read to determine the file's layout. If this layout can
be kept in the cache during the file access then fine. Otherwise the
FAT would need to be reread one or more times (probably starting from
the beginning on each reread). With HPFS, up to 361 sector reads, in a
three-level B+tree structure, and possibly up to just 13 sector reads,
in a two-level structure, would provide the information. The HPFS
figures are maximums and the actual sector-read figure would most
probably be much lower since HPFS was trying hard to reduce the number
of runs when the file was written. Also if the ALNODEs are near each
other then read-ahead would reduce the actual hits. Furthermore, OS/2
will keep the file's allocation information resident in memory while the
file is open, so no rereads would be needed.
<P>
If you've ever looked at the layout of files on a HPFS partition, you
may have been shocked to see the large gaps in the disk usage. This is
FAT-coloured thinking. There are good reasons not to use the first
available spot next to an existing file, particularly in a multitasking
environment where more than one write operation can be occurring
concurrently. HPFS uses three strategies here that I'm aware of.
First, the destination of write operations involving new files will tend
not to be near (preferably in a different band from) where an existing
file is also being updated. Otherwise, fragmentation would be highly
likely to occur.
<P>
Second, 4 KB of extra space is allocated by the file system to the end
of a file when it is created. Again the reason is to reduce the
likelihood of fragmentation from other concurrent writing tasks.
If not utilised, this space is recovered afterwards. To test this
assertion, create the REXX cmdfile shown in Figure 7 and run it on an
empty HPFS partition. (You can also do this on a partition with files
in it but it is easier on an empty one.) Run it and when the "Press any
key" message appears start up another OS/2 session and run CHKDSK (no
switches) on the partition under examination. CHKDSK will get confused
about the space allotted to the file open in the other session and will
say it is correcting an allocation error (which it really isn't doing
because you did not use the /F switch). Ignore this and notice that "4
kilobytes are in 1 user files". Switch back to the other session and
press Enter to close the file. Repeat and again run CHKDSK in the other
session. Notice this time that no extra space is allocated since the
file is being reopened rather than being created.
<PRE>
/* Test to check the space
preallocated to an open file */
CALL STREAM 'zerofile', 'c', 'open'
/* Will create if it does not exist */
'@pause'
CALL STREAM 'zerofile', 'c', 'close'
</PRE>
<FONT SIZE=2>
Fig. 7: A simple REXX program to demonstrate how HPFS allocates 4 KB of
diskspace to a new file.
</FONT>
<P>
Third, if a program has been written to report the likely filesize to
OS/2, or if you are copying an existing file (i.e. the final filesize
is known) then HPFS will expend a great deal of effort to find a free
space big enough to accommodate the file in one extent. If that is not
possible then it looks for two free spaces half the size of the file and
so on. Again this can result in two files in a directory not being next
to each other on the disk.
<P>
Since DOS and Windows programs are not written with preallocation space
requesting in mind, they tend to be more likely candidates for
fragmentation than properly written OS/2 programs. So, for example,
using a DOS comms program to download a large file will often result in
a fragmented file. Compared with FAT, though, fragmentation on heavily
used HPFS volumes is very low, usually less than 1%. We'll consider
fragmentation levels in more depth in Part 3.
<P>
<H2>Other Matters</H2>
<P>
It has also been written that the HPFS cache is smart enough to adjust
the value of its sector read-ahead for each opened file based on the
file's usage history or its type (Ray Duncan, 1989). It is claimed that
EXE files and files that typically have been fully read in the past are
given big read-aheads when next loaded. This is a fascinating concept
but unfortunately it has not been implemented.
<P>
Surprisingly, like other device drivers, HPFS is still 16-bit code. I
think this is one of the few remaining areas of 16-bit code in Warp. I
believe IBM's argument is that 32-bit code here would not help
performance much as mechanical factors are the ones imposing the limits,
at least in typical single-user scenarios.
<P>
HPFS is run as a ring 3 task in the 80x86 processor protection mechanism
i.e. at the application level. HPFS386 is a 32-bit version of HPFS
that comes only with IBM LAN SERVER Advanced Version. HPFS386 runs in
ring 0, i.e. at kernel level. This ensures the highest file system
performance in demanding network situations. It can also provide much
bigger caches than standard HPFS which is limited to 2 MB. There is a
chance that this version will appear in a later release of Warp.
<P>
OS/2 v2.x onwards also boosts the performance of FAT. This improvement,
called "Super FAT", is a combination of 32-bit executable code and the
mirroring of the FAT and directory paths in RAM. This requires a fair
bit of memory. Also Super FAT speeds the search for free space by
representing in memory in a bitmap used sectors in the FAT. This does
help the performance but I think the results in Figure 4, which were
performed using the Super FAT system, still highlight FAT's
architectural weaknesses.
<P>
You can easily tell whether a partition is formatted under HPFS or FAT. Just
run DIR in the root directory. If "." and ".." directory entries are shown
then HPFS is used [Unless the HPFS partition was formatted under Warp 4 -- Ed].
<P>
<H2>Conclusion</H2>
<P>
HPFS does require 300-400 KB of memory to implement, so it's only
suitable for OS/2 v2.1 systems with at least 12 MB or Warp systems with
at least 8 MB. For partitions of 100 MB+ it offers definite technical
advantages over FAT. By now you should have developed an understanding
of how these improvements are achieved.
<P>
In the next installment, we look at a shareware program to visually
inspect the layout of a HPFS partition and a REXX program to dump the
contents of a disk sector by specifying either decimal LSN, hexadecimal
LSN, dword byte-order-reversed hexadecimal LSN (what you see when you
look at a dword pointer in a hex dump) or Cyl/Hd/Sec coordinates. Other
REXX programs will convert the data stored in the SuperBlock and the
SpareBlock sectors into intelligible values. You should find it quite
informative.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,804 @@
<H1>Inside the High Performance File System</H1>
<H2>Part 3: Fragmentation, Diskspace Bitmaps and Code Pages</H2>
Written by Dan Bridges
<H2>Introduction</H2>
<P>
This article originally appeared in the May 1996 issue of Significant
Bits, the monthly magazine of the Brisbug PC User Group Inc.
<P>
This month we look at how HPFS knows which sectors are occupied and which ones
are free. We examine the amount of file fragmentation on five HPFS volumes and
also check out the fragmentation of free space. A program will be presented to
show free runs and some other details. Finally, we'll briefly discuss Code
Pages and look at a program to display their contents.
<P>
<H2>How Sectors are Mapped on a HPFS Volume</H2>
<P>
The sector usage on a HPFS partition is mapped in data band bitmap blocks.
These blocks are 2 KB in size (four sectors) and are usually situated at either
the beginning or end of a data band. A data band is almost 8 MB. (Actually
8,190 KB since 2 KB is needed for its bitmap.) See Figure 1. The state of each
bit in the block indicates whether or not a sector (HPFS' allocation unit) is
occupied. If a bit is set (1) then its corresponding sector is free. If the
bit is not set (0) than the sector is occupied. Structures situated within the
confines of a data band such as Code Page Info &amp; Data sectors, Hotfix
sectors,
the Root Directory DirBlk etc. are all marked as fully occupied within that
band's usage bitmap.
<P>
<IMG WIDTH=435 HEIGHT=257 SRC="fig1.gif">
<P>
<FONT SIZE=2>
Figure 1: The basic data layout of a HPFS volume.
</FONT>
<P>
Since each bit maps a sector, a byte maps eight sectors and the complete 2 KB
block maps the 16,384 sectors (including the bitmap block itself) in a 8 MB
band. And since two blocks can face each other, we arrive at the maximum
possible extent (fragment) size of 16,380 KB. Examine Figure 2 now to see
examples of file and freespace mapping.
<P>
<IMG WIDTH=429 HEIGHT=302 SRC="fig2.gif">
<P>
<FONT SIZE=2>
Figure 2: The correspondence of the first five bytes in a data band's usage
bitmap to the first 40 sectors in the band.
</FONT>
<P>
In this example we see 23 occupied sectors ("u") and 4 unoccupied areas (".")
which we will refer to as "freeruns" [of sectors]. At one extreme, the 23
sectors might belong to one file (here in four extents) while at the other
extreme we might have the FNODEs of 23 "zero-length" files. (Every file and
directory entry on a HPFS volume must have an FNODE sector.)
<P>
The advantages of the bitmap approach are twofold. First, the small allocation
unit size on a HPFS volume means greatly reduced allocation unit wastage
compared to large FAT partitions. Second, the compact mapping structure makes
it feasible for HPFS to quickly search a data band for enough free space to slot
in a file of known size, in one piece if possible. For example, as just
mentioned HPFS can map 32,760 allocation units with just 4 KB of bitmaps whereas
a 16-bit FAT structure requires 64 KB (per FAT copy) to map 32,768 allocation
units.
<P>
<H2>A Fragmentation Analysis</H2>
<P>
In this section we'll examine the level of fragmentation on the five HPFS
partitions of my first HD. Look at Figure 3. Notes:
<P>
1. A time-since-last-defrag figure of "Never" means that I've never run a
defragger across this partition since upgrading to OS/2 Warp 118 days ago. This
value is stored in the SuperBlock (LSN 16) and was determined by using the
ShowSuperSpare REXX program featured in Part 2.
<P>
2. The fragmentation levels were reported by the wondrous FST (freeware) with
"FST -n check -f C:" while the names of the fragmented files and their sizes
came from the GammaTech Utilities (commercial) "HPFSOPT C: -u -d -o1 -l
logfile". You can also use the Graham Utilities (commercial) "HPFS-EXT C: -s".
<P>
3. The high number of files with 0 data extents on C: is due to the presence of
the WPS folders on this drive. Each of these has "zero" bytes in the main file
but they usually have bytes in EAs.
<P>
4. Files with 0 or 1 extents are considered to fully contiguous, so I've placed
them in one grouping.
<P>
5. Files with 2-8 extents are considered to be "nearly" contiguous" since the
fragments will usually be placed close together on the disk and also because a
list of the location and length of up to 8 extents can be kept in a file's FNODE
sector. This list will be kept memory resident while the file is open. Note 1:
the extents themselves can not be kept memory resident since, theoretically,
they could be up to 8*16,380 KB in size. But no non-data disk reads, after the
initial read of the FNODE, would be required to work with the file. Note 2:
under some circumstances, the 8 extents, if small enough, could be kept memory
resident in the sense that they could be held in HPFS' cache. We will consider
FNODEs in detail in a later installment.
<P>
6. Files with more than 8 extents have too many fragments to be listed in their
FNODEs. Instead an B+tree allocation sector structure (an ALSEC) is used to map
the extents. The sector mappings are small enough to keep memory resident while
the file is open. ALSECs will be covered in a latter installment.
<P>
7. EAs are usually not fragmented since, in the current implementation of OS/2,
the total EA size associated with any one file is only 64 KB. If a file has EAs
in 0 extents then the EA information is stored completely within the FNODE
sector. (There is space in the FNODE for up to 145 bytes of "internal" EAs.)
In all other cases on my system they currently stored in single, external runs
of sectors. EAs will be covered in later installments.
<P>
<IMG WIDTH=443 HEIGHT=490 SRC="fig3.gif">
<P>
<FONT SIZE=2>
Figure 3: Fragmentation analysis of five HPFS partitions.
</FONT>
<P>
We now turn to the topic of what circumstances are leading to file fragmentation
on these partitions.
<P>
C: _ The OS/2 system partition. I've run out of space on this drive on
occasions. Activity here occurs though the running of Fixpacks (FP 16 and then
FP 17 were run), INI maintenance utilities and driver upgrades. There is really
nothing of concern here. Most HPFS defraggers suggest not trying to defrag
files that have less than 2 or 3 extents since you run the risk of fragmenting
the free space. We will return to this topic shortly.
<P>
D: _ My main work area and the location of communications files. I use the DOS
comms package TELEMATE because I've always liked its features (although OS/2 has
to work hard to handle its modem access during a file transfer - OS/2 comms
programs, in general, are much less demanding of the CPU's attention). The
other major comms package I use is OS/2 BinkleyTerm v2.60 feeding OS/2 Squish
message databases. The fragmented files consist mainly of files downloaded by
TELEMATE (DOS comms programs do not inform HPFS, ahead of time, of how much
space the downloaded file will occupy) and Squish databases (*.SQD). The drive
was defragged 53 days ago at which time no special effort was made to reduce
file fragmentation below 2-3 extents, accounting for the presence of 245 files
with two extents. This really is an insignificant amount regardless of what the
4% figure may lead you to believe.
<P>
The most fragmented file on this partition is a 150 KB BinkleyTerm logfile with
30 extents. The main reason I can see for fragmentation in this case is that
the file is frequently being updated with information while file transfers are
in progress. The Squish databases are also prone to fragmentation. Out of a
total of 25 database files there were 8, averaging 500 KB each, with a average
of 15 extents.
<P>
E: _ The fragmentation here was insignificant apart from a single 2.8 MB
executable Windows program that has had a DOS patch program run over it,
resulting in 38 fragments. The 2-extent files were mainly data files that are
produced by this same Windows package (being run under WIN-OS2).
<P>
F: _ Almost no fragmentation since this partition is reserved for DOS programs
and I don't use them much.
<P>
G: _ My second major work partition. Fragmentation is low and unlikely to go
much lower since 2 extents is considered below the point of defragger
involvement.
<P>
The conclusions to be drawn from the above is that, if you don't get too hot
under the collar about some files having 2 or 3 extents then there will
generally be little need to worry about fragmentation under HPFS. Only certain
types of files (some comms/DOS/Windows) will be candidates. And keeping
partitions less than 80% full should help reduce general fragmentation as well.
<P>
<H2>Defragmenting Files</H2>
<P>
Since fragmentation is a relatively minor concern under HPFS there is not much
of an argument for purchasing OS/2 utilities based mainly on their ability to
defragment HPFS drives, especially since it's not hard to defragment files
yourself. You see, providing there is enough contiguous freespace on a volume,
the mere act of copying the files to a temporary directory, deleting the
original and then moving the files back will usually eliminate, or at least
reduce fragmentation since HPFS, knowing the original filesize, will look for a
suitably sized freespace. The success of this technique is demonstrated in
Figure 4 where 25 Squish database files (*.SQD) totalling 5.7 MB where shuffled
about on D:. Note: don't use the MOVE command to initially transfer the files
to the temp directory since this will just alter the directory entry rather than
actually rewriting the files.
<P>
<IMG WIDTH=159 HEIGHT=232 SRC="fig4.gif">
<P>
<FONT SIZE=2>
Figure 4: Number of extents in 25 SQD files before and after the defrag process
described in the text.
</FONT>
<P>
I've used the GU's HPFS-EXT to report these figures. This is freely available
in the GULITE demo package. Note: the fully functional HPFSDFRG is also in
this package but I wanted to show that it's not that hard to do this by hand.
HPFSDFRG does much the same as I did except that you can specify the
optimisation threshold (minimum number of extents before a file becomes a
candidate) and it will retry the copying operation up to ten times if there are
more extents after the operation than before it (due to heavily fragmented
freespace).
<P>
<H2>The Fragmentation of Freespace</H2>
<P>
Another significant aspect of HPFS' fragmentation resistance is how well the FS
keeps disk freespace in big, contiguous chunks. If the current files on a
partition are relatively fragmentation free but the remaining freespace is
arranged in lots of small chunks then there is a good change that new files will
be fragmented. You can check this with "FST -n info -f C:". This produces a
table that counts the number of freespace extents that are 1, 2-3, 4-7, 8-15,
... 16384-32767 sectors long. In my opinion though it is more important to
consider the product of the actual extent size by their frequency since the
presence of numerous 1-extent spaces are not important if there are still a
number of large spaces available.
<P>
Figure 5 shows the output of the REXX program ShowFreeruns.cmd. The partition
of 100 MB is almost empty. The display shows the location of the 2 KB block
that holds the list of the starting LSNs of each bitmap block (this figure comes
from the dword at offset 18h in the SuperBlock), the location of each bitmap
block on the left and the sector size and location of freespace on the right.
As you see, this partition has 13 data bands, 6 of which face each other. A
version of ShowFreeruns.cmd that only outputs the run size was used to generate
a list of figures. This list was loaded into a spreadsheet, sorted and a
frequency distribution performed. See Figure 6. You can see that C: has no
large areas remaining, D: has the majority of its freespace in the 4 MB &lt; 8 MB
range and that E:, F: and G: have kept large majorities of their freespace in
very big runs. Overall, this is quite good performance.
<PRE>
Inspecting drive O:
List of Bmp Sectors: 0x00018FF0 (102384)
Space-Usage Bitmap Blocks:
Freespace Runs:
0x00000014-00000017 (20-23)
0x00007FFC-00007FFF (32764-32767)
130-32763 (#1:32634)
0x00008000-00008003 (32768-32771)
0x0000FFFC-0000FFFF (65532-65535)
32772-65531 (#2:32760)
0x00010000-00010003 (65536-65539)
0x00017FFC-00017FFF (98300-98303)
65540-81919 (#3:16380)
81926-98291 (#4:16366)
0x00018000-00018003 (98304-98307)
0x0001FFFC-0001FFFF (131068-131071)
100369-102383 (#5:2015)
102400-131067 (#6:28668)
0x00020000-00020003 (131072-131075)
0x00027FFC-00027FFF (163836-163839)
131076-163835 (#7:32760)
0x00028000-00028003 (163840-163843)
0x0002FFFC-0002FFFF (196604-196607)
163844-196603 (#8:32760)
0x00030000-00030003 (196608-196611)
196612-204767 (#9:8156)
</PRE>
<FONT SIZE=2>
Figure 5: Output from the ShowFreeruns.cmd REXX program.
</FONT>
<P>
<IMG WIDTH=429 HEIGHT=378 SRC="fig6_3.gif">
<P>
<FONT SIZE=2>
Figure 6: Freespace analysis on five HPFS partitions.
</FONT>
<P>
<H2>The ShowFreeruns Program</H2>
<P>
Like other programs in this series, ShowFreeruns.cmd (see Figure 7) uses
SECTOR.DLL to read a sector off a logical drive. I was motivated to design this
program after seeing the output of the GU's "HPFSINFO C: -F". On a one-third
full 1.2 GB partition, the program presented here takes 17 secs compared to
HPFSINFO's time of 26 secs. HPFSINFO also shows the CHS (Cyl/Hd/Sec)
coordinates of each run. I was not interested in these but instead display the
freerun's size. HPFSINFO also displays the meaning of what's in the SuperBlock
and the SpareBlock. If you want to do this, you can include the code from
ShowSuperSpare.cmd from Part 2 and it will only add an extra 0.5 secs to the
time. The performance then, for a interpreted program (REXX), is quite good and
was achieved primarily through a speed-up technique to be discussed shortly.
Moreover, HPFSINFO consistently overstates the end of each freerun by 1 and it
sometimes does not show the last run (e.g. on C: it states that there are 366
freeruns but only shows 365 of them). This last bug appears to be caused by the
last freerun continuing to the end of the partition. My design accounts for
this situation.
<PRE>
/* Shows bitmap locations and free space runs */
ARG drive . /* First parm should always be drive */
IF drive = '' THEN CALL HELP
parmList = "? /? /H HELP A: B:"
IF WordPos(drive, parmList) \= 0 THEN CALL Help
/* Register external DLL functions */
CALL RxFuncAdd 'ReadSect','Sector','ReadSect'
CALL RxFuncAdd 'RxDate','RexxDate','RxDate'
/* Initialise Lookup Table*/
DO exponent = 0 TO 7
bitValue.exponent = D2C(2**exponent)
END exponent
secString = ReadSect(drive, 16) /*Read Superblk sec*/
freespaceBmpList = C2D(Reverse(Substr(secString,25,4)))
totalsecs = C2D(Reverse(Substr(secString,17,4)))
'@cls'
SAY
SAY "Inspecting drive" drive
SAY
/* LSN 25 = list of bitmap blocks */
CALL ShowDword " List of Bitmap secs",25
startOfListBlk = 0
startOfBlk = 0
bmpListBlk = ""
bmpBlk = ""
getFacingBands = 0
runNumber = 0
byteOffset = 0
runNumber = 0
/* Read in 4 secs of the list of sec-usage bmp blks */
DO secWithinBlk = freespaceBmpList TO freespaceBmpList+3
temp = StartOfListBlk + secWithinBlk
bmpListBlk = bmpListBlk||ReadSect(drive, temp)
END secWithinBlk
SAY
SAY "Space-Usage Bitmap Blocks:"
SAY " Freespace Runs:"
/* Use dword pointers to bmps to read in 2KB bmp blks */
DO listOffset = 1 TO 2048 BY 4
startDecStr = C2D(Reverse(Substr(bmpListBlk,ListOffset,4)))
IF startDecStr = 0 THEN /* No more bmps listed */
DO
IF getFacingBands = 1 THEN
DO /* Last data band had no facing data band */
bmpSize = 2048
CALL DetermineFreeruns
LEAVE
END
LEAVE
END
/*Display a blank line when a new facing band occurs*/
IF (ListOffset+7//8 = 0 THEN SAY
CALL ShowBmpBlk listOffset
DO secWithinBlk = 0 TO 3
temp = StartOfBlk + secWithinBlk
bmpBlk = bmpBlk||ReadSect(drive, temp)
END secWithinBlk
getFacingBands = getFacingBands + 1
IF getFacingBands = 2 THEN /* Wait until you get both */
DO /* bmps for the facing data*/
bmpSize = 4096 /* bands since maximum extent*/
CALL DetermineFreeruns /* length is 16,380 KB */
byteOffset = byteOffset+4096
getFacingBands = 0
bmpBlk = ""
END
END listOffset
EXIT /**************EXECUTION ENDS HERE**************/
FourBytes2Hex: /* Given offset, return dword */
ARG startPos
rearranged = Reverse(Substr(secString,startPos,4))
RETURN C2X(rearranged)
ShowDword: /* Display dword and dec equivalent */
PARSE ARG label, offset
hexStr = FourBytes2Hex(offset)
SAY label": 0x"hexStr "("X2D(hexStr)")"
RETURN
ShowBmpBlk:
/* Show start-end of freespace runs in hex &amp; dec */
PARSE ARG offset
endDecStr = C2D(Reverse(Substr(bmpListBlk,offset,4)))+3
SAY " 0x"D2X(startDecStr,8)"-"D2X(endDecStr,8)
" ("startDecStr"-"endDecStr")"
startOfBlk = startDecStr
RETURN
DetermineFreeruns:
runStatus = 0
oldchar = ''
/* Check 128 secs at a time to speed up operation */
DO para = 1 to bmpSize BY 16
/* 16 bytes*8 secs/byte = 128 secs per para scanned */
char = Substr(bmpBlk,para,16)
IF char = 'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF'x &amp;,
runstatus = 1 THEN ITERATE para
IF char = '00000000000000000000000000000000'x &amp;,
runstatus = 0 THEN ITERATE para
/* Part of paragraph has run start/end
so check a byte (8 secs) at a time. */
DO byte = para TO para + 15
char = Substr(bmpBlk,byte,1)
IF char &gt; '0'x THEN /* 1 or more free secs */
DO
IF char = 'FF'x THEN /* 8 unoccupied secs */
IF runStatus = 1 THEN /* Run is in progress */
NOP
ELSE /* Run starts on 8 sec boundary */
DO
startByte = byte + byteOffset
startBitPos = 0
runStatus = 1 /* Start run determination */
END
ELSE
CALL DetermineBit /* Partial usage of 8 secs */
END
ELSE
DO /* All 8 secs are used */
IF runStatus = 1 THEN
DO
endByte = byte + byteOffset
endBitPos = -1 /* Run ends with prior sec */
CALL ShowRun
END
END
END byte
END para
IF runStatus = 1 THEN /* Freespace at end of part. */
DO
endByte = 9999999999 /* Larger than # of secs in */
endBitPos = 0 /* max. possible part.(512GB) */
CALL ShowRun /* so ShowRun will set runEnd */
/* to last LSN in this part. */
END
RETURN
DetermineBit: /* Free/occupied usage within 8 sec blk */
DO bitPos = 0 TO 7
IF runStatus = 0 THEN
DO /* No run currently in progress */
IF BitAnd(char, bitValue.bitPos) &gt; '0'x THEN
DO /* sec is free */
startByte = byte + byteOffset
startBitPos = bitPos
runStatus = 1
END
END
ELSE
DO
IF BitAnd(char, bitValue.bitPos) = '0'x THEN
DO /* sec is used */
endByte = byte + byteOffset
/* When a run ends, the sec before the first
used one is the last sec in the freerun. */
endBitPos = bitPos - 1
CALL ShowRun
END
END
END bitPos
RETURN
ShowRun:
/* Display freerun start-end secs &amp; reset run status */
runNumber = runNumber + 1
runStart = (startByte - 1) * 8 + startBitPos
runEnd = (endByte - 1) * 8 + endBitPos
IF runEnd &gt; totalSecs THEN runEnd = TotalSecs - 1
IF runStart \= runEnd THEN /* More than 1 sec is free */
DO
run = runStart"-"runEnd
run = Left(run||Copies(" ",14),15)
SAY Copies(" ",40) run "(#"runNumber":"runEnd-RunStart+1")"
END
ELSE
DO
run = Left(runStart||Copies(" ",14),15)
SAY Copies(" ",40) run "(#"runNumber":1)"
END
runStatus = 0
RETURN
Help:
SAY
SAY "Purpose:"
SAY " ShowFreeruns displays the location of the
sec-usage bitmap blocks" /* Wrapped long line */
SAY " and the location and extent of free space runs."
SAY
SAY "Example:"
SAY " ShowFreeruns C:"
SAY
EXIT
</PRE>
<FONT SIZE=2>
Figure 7: The ShowFreeruns.cmd REXX program. Requires SECTOR.DLL. Note that
the long SAY line (line 40) should include the next line as well. (SAY clauses
can't be continued on to the next line with a comma.)
</FONT>
<P>
Since a sector is mapped by a bit, the program often needs to check the status
of a bit within a bitmap's byte. This is done using the BITAND(string1,
string2) inbuilt function. In this design string 1 holds the byte to be
examined and string 2 holds a character that only has the corresponding bit set.
Rather than having to work out the character for string 2 each time BITAND() is
used, we instead precalculate the eight characters and then store them in the
BitValue. compound variable for later use.
<P>
The next step is to read in the SuperBlock and from it get the location of the
list of bitmap sectors and the total number of sectors. The later value is
required so we know when we've reached the end of the partition.
<P>
We then read in the four sectors of the block holding the list of bitmaps. The
list consists of dwords that store the starting LSN of each bitmap block. 128
dwords can fit in each sector of the list so the four sectors of the list can
hold 512 bitmap block LSNs. Now a bitmap block maps 8 MB of diskspace so this
'lite' version is only good when dealing with a partition of less than 4 GB.
(Earlier works refer to the maximum partition size as 512 GB but in the recent
"Just Add OS/2 Warp" package, in its technical section, it is stated that the
maximum partition size is 64 GB.) I won't be able to check this aspect of the
design until I get a HD bigger than 4 GB and succumb to the mad urge to
partition it as one volume.
<P>
The end of the list is indicated by the first occurrence of 0000h. The list of
the 100 MB partition shown in Figure 5 contains only 13 dwords since it has 13
data bands so, in a typical case, you should not expect to find much data stored
in this block.
<P>
A freerun can be bigger than a data band since pairs of bands face each other,
so we consider two bands at a time, unless we reach the end of the partition
without a facing band. Once we have a data region we call the DetermineFreeruns
procedure. Here we examine the two, combined data bitmaps (unless it's a solo
band at the end). In the initial design I looked at each byte in the 4 KB
bitmap combination to see it if it was either 00h (all eight sectors used) or
FFh (all eight sectors free). Typically, you will find lots of occupied or free
sectors together, so checking eight at a time speeds up the search. Only when
the byte was neither of these is a bit-level search required.
<P>
However, the speed of this version was poor, with the search though each byte of
the 322 KB of bitmaps for the 161 databands in the 1.2 GB partition taking a
total of 104 secs. The obvious solution was to extend the optimisation method
to a second, higher level by checking more bytes first to see if they were all
set or clear. I settled on 16 bytes which covers 128 sectors (64 KB) of
diskspace at a time and this resulted in the final time of 17 secs. Further
experiments with larger (64 byte) groups and also with third-level optimisation
did not show much improvement with my mix of partitions but your situation may
warrant further experimentation.
<P>
<H2>Code Pages</H2>
<P>
Different languages have different character sets. Code Pages (CPs) are used to
map an ASCII character to the actual character. CP tables reside in
COUNTRY.SYS. They are also present on a HPFS volume and every directory entry
(DIRENT) includes a CP index value.
<P>
CPs are used to map character case (i.e. in a foreign character set the
relationship between lower and upper-case characters) and for collating
sequences used when sorting. As mentioned in Part 1, HPFS directories use a
B-tree structure which, as part of its operation, always store file/directory
names in sorted order. Remember that HPFS is not case-sensitive (including when
sorting) but it preserves case.
<P>
The European-style language (including English) have relatively straightforward
Single-Byte Character Sets (SBCS) i.e. one character is represented by one
byte. Asian character sets typically have many characters so they require two
bytes per character (DBCS).
<P>
The first 128 characters in all ASCII CPs are the same so the CP tables on the
disk only map ASCII 128-255.
<P>
The SpareBlock holds the LSN of the first CP Info sector. There is a header
followed by up to 31 16-byte CP Info Entries. There is provision for more than
one CP Info sector which could hold CP Info Entries 31-61 (counting from 0).
Why so many different CPs are catered for I have no idea since I've been unable
to have more than two loaded at a time. In Australia we typically use CP437
(standard PC) - Country 061 and CP850 (multilingual Latin-1) - Country 000. The
layout of a CP Info sector is shown in Figure 8.
<P>
<IMG WIDTH=431 HEIGHT=400 SRC="fig8.gif">
<P>
<FONT SIZE=2>
Figure 8: The layout of a Code Page Infomation Sector.
</FONT>
<P>
The CP Info Entry contains the LSN where this entry's CP mapping table is
stored. This sector is a CP Data Sector. As well as a header there is enough
space for up to three 128-byte CP maps per sector. Figure 9 shows the layout of
a CP Data Sector.
<P>
<IMG WIDTH=431 HEIGHT=450 SRC="fig9.gif">
<P>
<FONT SIZE=2>
Figure 9: The layout of a Code Page Data Sector.
</FONT>
<P>
<H2>The CP.cmd Program</H2>
<P>
Figure 10 shows the display produced by the REXX CP.cmd program (Figure 11).
I've stopped it before it reached ASCII 255. Normally, the output will scroll
off the screen, so either pause it or send it to the printer. If the mapped
character has the same value as its ASCII value the word "same" is displayed
instead to reduce clutter.
<P>
<IMG WIDTH=430 HEIGHT=320 SRC="fig10.gif">
<P>
<FONT SIZE=2>
Figure 10: Partial output from the CP.cmd program. List continues on to ASCII
255.
</FONT>
<PRE>
/* Decodes CP info &amp; CP data sectors on a HPFS volume */
ARG drive . /* First parm should always be drive */
IF drive = '' | drive = "?" | drive = "HELP",
| drive = "A:" | drive = "B:" THEN CALL Help
CALL RxFuncAdd 'ReadSect','Sector','ReadSect' /* In SECTOR.DLL */
secString = ReadSect(drive,17) /* SpareBlock is LSN 17 */
'@cls'
SAY
SAY "Inspecting drive" drive
SAY
/* Offset 33 in Spareblock contains dword of CP info LSN */
cpInfoSec = C2D(Reverse(Substr(secString,33,2)))
secString = ReadSect(drive,cpInfoSec) /* Load CP info sec */
numOfCodePages = C2D(Reverse(Substr(secString,5,2)))
prevDataSec = ''
SAY "CODE PAGE INFORMATION (sector" cpInfoSec"):"
SAY "Signature Dword: 0x"FourChars2Hex(1)
SAY " CP# Ctry Code Code Page CP Data Sec Offset"
DO x = 0 TO numOfCodePages-1
hexCountry = TwoChars2Hex((16*x)+17)
decCountry = Right('00'X2D(hexCountry),3)
cp = TwoChars2Hex((16*x)+19)
country.x = X2D(cp)
hexSec = FourChars2Hex((16*x)+25)
decSec = X2D(hexSec)
cpDataSec = decSec
/* Since up to 3 CP tables can fit in 1 CP data sec,
only read in a new data sec when the need arises. */
IF cpDataSec \= prevDataSec THEN
DO
dataSecString = ReadSect(drive,cpDataSec)
prevDataSec = cpDataSec
END
offset = C2D(Reverse(Substr(dataSecString,(2*x)+21,2)))
start = offset + 1
SAY " " x " 0x"hexCountry "("decCountry") 0x"cp "("X2D(cp)") 0x"
hexSec "("decSec") 0x"D2X(offset) "("offset")"
/* Wrapped long line */
/* Store table contents of each CP in an array */
DO y = 128 TO 255
char = Substr(dataSecString,start+6+y-18,1)
IF C2D(char) \= y THEN
array.x.y = Format(C2D(char),4) "("char")"
ELSE
array.x.y = " same "
END y
END x
/* Work out title line based on number of CPs */
titleLine = " ASCII "
DO x = 0 TO numOfCodePages-1
titleLine = titleLine " CP" country.x
END x
SAY
SAY titleLine
/* Display each table entry based on number of CPs */
DO y = 128 TO 255
dispLine = ''
DO x = 0 TO numOfCodePages-1
dispLine = dispLine" "array.x.y
END x
SAY "" y "("D2C(y)"):" dispLine
END y
EXIT /****************EXECUTION ENDS HERE****************/
FourChars2Hex:
ARG offset
RETURN C2X(Reverse(Substr(secString,offset,4)))
TwoChars2Hex:
ARG offset
RETURN C2X(Reverse(Substr(secString,offset,2)))
Help:
SAY "Purpose:"
SAY " CP decodes the CodePage Directory sector &amp;"
SAY " the CodePage sector on a HPFS volume"
SAY
SAY "Example:"
SAY " CP C:"
EXIT
</PRE>
<FONT SIZE=2>
Figure 11: The CP.cmd REXX program. Requires SECTOR.DLL.
</FONT>
<P>
While REXX does not support arrays it does have compound variables and I've used
a CV called "array" to store the contents of each CP's mapping table. The
design only deals with the first 31 CP Info entries (that should be more than
enough anyway) and accommodates additional CPs by adding new columns to the
display.
<P>
Armed with this printout you can experiment with different collating sequences
when switching CPs. You can check out your current CP by typing "CHCP" and then
switch to a different CP by issuing, say, "CHCP 850". I used "REM &gt;
File[Alt-nnn]" to create zero-length files, with one or more high-order ASCII
characters in their filenames, as test fodder.
<P>
<H2>Conclusion</H2>
<P>
In this installment you've learned how to decode the data band usage bitmaps
contents and how to display the contents of the Code Page mapping tables. Next
time we'll examine B-trees, DIRBLKs and DIRENTs.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,932 @@
<H1>Inside the High Performance File System</H1>
<H2>Part 5: FNODEs, ALSECs and B+trees</H2>
Written by Dan Bridges
<H2>Introduction</H2>
This article originally appeared in the August 1996 issue of Significant
Bits, the monthly magazine of the Brisbug PC User Group Inc.
<P>Last month you saw how DIRENTs (directory entries) are stored in
4-sector structures known as DIRBLKs. These blocks have limited space
available for entries. Due to the variable length of filenames (1-254
characters), the maximum number of entries depends on the average filename
length. If the average name length is in the 10-13 character range, a
DIRBLK can hold up to 44 entries.
<P>When there are more files in a directory then can fit in a single
DIRBLK, other DIRBLKs will be used and the connection between these blocks
forms a structure known as a B-tree. Since there can be many elements
(entries) in a node (DIRBLK), a HPFS B-tree has a quick "fan-out" and a
low height (number of levels), ensuring fast entry location.
<P>This time, we'll take a long look at how a file's contents are
logically stored under HPFS. To the best of my knowledge, this topic has
not been well-covered in the scanty information available about HPFS. You
will find it helpful to contrast the following file-sector allocation
methods with last month's directory entry concepts.
<H2>Fragging a File</H2>
Since HPFS is inherently fragmentation-resistant, we have to twist its arm
a little to produce fragmented files. The method I came up with first
fills up an empty partition with a number of files created in an ascending
name sequence. The next step deletes every second file. Finally, I create
a file that is approximately one-half the partition's size. This file then
has nowhere to go except into all the discontiguous regions previously
occupied by the deleted file entries.
<P>This process takes some time with a large partition (100 MB) so I
suggest you use a very small partition (1 MB). At first glance, you may
think that if we fill up a 1 MB partition with say 100 files, then delete
File1, File3, ... File99, and then create a 512K file, we will end up with
a file with exactly 50 extents (fragments). This is not so, since each
individual file occupies a FNODE sector as well as the sectors for the
file itself, whereas a single fragmented file still has only 1 FNODE. So
there is slightly more space available in each gap for an extent than
there was for a file, and a 512K file will find more than 512K of space
available and ends up occupying fewer gaps than expected and we end up
with a smaller number of extents than was specified. For example, in the
50-gap, 1 MB partition scenario we end up with 45 extents. There are also
variations produced by things like the centrally located DIRBAND, the
separate Root DIRBLK and multiple Databands to "fragment" the available
freespace for very large files. So the number of gaps produced by deleting
alternate files is only an rough approximation of the number of extents
that will be produced.
<P>Figure 1 shows the MakeExtents.cmd REXX program. You specify the number
of gaps you want to produce. For example, to originally produce 100 files
on N:, delete half of them and leave 50 gaps, you would issue the command
"MakeExtents N: 50".
<PRE>
/* Produces a large, fragmented file */
PARSE ARG numOfExts
CALL RxFuncAdd 'SysLoadFuncs', 'RexxUtil', 'SysLoadFuncs'
CALL SysLoadFuncs /* Load REXXUTIL.DLL external funcs */
CALL SysCls
EXIT /* Safety line. Delete this when you've adjusted the
drive to suit your system. Formats the drive. */
'echo y | format n: /l /fs:hpfs'
SAY
CALL SysMkDir 'n:\test' /* REXX MD. Faster than OS/2 MD */
currentDir = Directory() /* Store current drive/directory */
CALL Directory 'n:\test' /* Change to test dirve/directory*/
/* Determine free space */
PARSE VALUE SysDriveInfo('n:') WITH . free .
/* Determine size of each sequential file */
fileSize = (free - (numOfExts*2*512)) % (numOfExts*2)
secsInFile = fileSize % 512
sectorFill = Copies('x',512) /* 512 bytes of 'x' char */
Fill_20K = Copies(sectorFill,40) /* 20,480 bytes of 'x' */
/* Create string of the required length */
CALL MakeFile secsInFile
DO i = 1 TO numOfExts*2 /* Produce the file sequence */
CALL CreateFile /* Fixed-length filenames: File00001 */
END i
DO i = 1 TO numOfExts*2 BY 2 /* Delete alternate files */
CALL SysFileDelete 'n:\test\file'||Right("0000"||i,5)
END i
PARSE VALUE SysDriveInfo('n:') WITH . free .
fragmentedFileSecs = ((free-512) % 512)-1
CALL MakeFile fragmentedFileSecs
i='FRAGG' /* Fragmented filename: FileFRAGG */
CALL CreateFile /* Create "FileFRAGG" */
CALL Directory currentDir /* Return to original location */
EXIT /********************************************/
MakeFile: PROCEDURE EXPOSE file sectorFill fill_20K
ARG secs
file = ''
/* If final file is over 20K, speed up creation a little */
IF secs&gt;40 THEN
file = Copies(fill_20K, secs%40)
file = file||Copies(sectorFill, secs//40)
RETURN file
CreateFile:
CALL Charout 'n:\test\file'||Right("0000"||i,5),file,1
CALL Stream 'n:\test\file'||Right("0000"||i,5),'C','CLOSE'
RETURN
</PRE>
<FONT SIZE=2>
Figure 1: The MakeExtents.cmd program produces a fragmented file. When set up
correctly, this program will wipe a partition.
</FONT>
<H2>FNODEs, ALSECs, ALLEAFs and ALNODEs</H2>
Every file and directory on a HPFS partition has an associated FNODE,
usually situated in the sector just before the file's first sector. The
role of an FNODE is quite specific: to map the location of the file's
extents (fragments) and any associated components, namely EAs (Extended
Attributes - up to 64K of ancillary information) and ACLs (Access Control
Lists - to do with LAN Manager).
<P>FNODEs and ALSECs (to be discussed shortly) contain a list of either
ALLEAF or ALNODE entries. See Figure 2. An ALLEAF entry contains three
dwords: logical sector offset (where the start of this run of sectors is
within the total number of sectors in the file - the logical start sector
is 0); run size in sectors; physical LSN (where the run starts in the
partition). An ALLEAF entry is at the end of the B+tree. An ALNODE entry
is an intermediate component in that it does not contain any extent
information. Rather, it points to an ALSEC, and in turn the ALSEC can
contain a list of either ALLEAFs (the end of the line) or ALNODEs (another
descendant level in the B+tree).
<PRE>
Offset Data Size Comment
hex (dec) bytes
Header
00h (1) Signature 4 0xF7E40AAE
04h (5) Seq. Read History 4 Not implemented.
08h (9) Fast Read History 4 Not Implemented.
0Ch (13) Name Length 1 0-254.
0Dh (14) Name 15 Last 15 chars. (Full name in DIRBLK.)
1Ch (29) Container Dir LSN 4 FNODE of Dir that contains this one.
20h (33) ACL Ext. Run Size 4 Secs in external ACL, if present.
24h (37) ACL LSN 4 Location of external ACL run.
28h (41) ACL Int. Size 2 Bytes in internal (inside FNODE) ACL.
2Ah (43) ACL ALSEC Flag 1 &gt;0 if ACL LSN points to an ALSEC.
2Bh (44) History Bits Count 1 Not implemented.
2Ch (45) EA Ext. Run Size 4
30h (49) EA LSN 4
34h (53) EA Int. Size 2
36h (55) EA ALSEC Flag 1 &gt;0 if EA LSN points to an ALSEC.
37h (56) Dir Flag 1 Bit0 = 1 if dir FNODE, else file FNODE.
38h (57) B+Tree Info Flag 1 0x20 (5) Parent is an FNODE, else ALSEC.
0x80 (7) ALNODEs follow, else ALLEAFs.
39h (58) Padding 3 Reestablish 32-bit alignment.
3Ch (61) Free Entries 1 Number of free array entries.
3Dh (62) Used Entries 1 Number of used array entries.
3Eh (63) Free Ent. Offset 2 Offset to next free entry in array.
If ALLEAFs (Maximum of 8 in an FNODE)
Extent #0
40h (65) Logical LSN 4 Sec offset of this extent within file.
The first extent has an offset of 0.
44h (69) Run Size 4 Number of sectors in this extent.
48h (73) Physical LSN 4 File: LSN of extent start.
Dir: This B-tree's topmost DIRBLK LSN.
...
Extent #7
94h (149) Logical LSN 4
98h (153) Run Size 4
9Ch (157) Physical LSN 4
If ALNODEs (Maximum of 12 in an FNODE)
Extent #0
40h (65) End Sector Count 4 Running total of secs mapped by this
alnode. 1-based. If EOF is within this
alnode then field contains 0xFFFFFFFF.
44h (69) Physical LSN 4 File: LSN of ALSEC.
Dir: This B-tree's topmost DIRBLK LSN.
...
Extent #11
98h (153) End Sector Count 4
9Ch (157) Physical LSN 4
Tail
A0h (161) Valid File Length 4 Should be the same as File Size in DIRENT.
A4h (165) "Needed" EAs Count 4 If any, EAs vital to the file's wellbeing.
A8h (169) User ID 16 Not used.
B8h (185) ACL/EA Offset 2 Offset in FNODE to first ACL, if present,
otherwise offset to where EAs would be
stored, if internalised.
BAh (187) Spare 10 Unused.
C4h (197) ACL/EA Storage 316 Only 145 bytes appear avaiable for EAs.
</PRE>
<FONT SIZE=2>
Figure 2: Layout of an FNODE. This component can contain either an array
of ALNODE or ALLEAF entries.
</FONT>
<P>Returning to the B-tree structure of DIRBLKs, you will remember that
both intermediate and leaf components contain DIRENT data. So you may find
the entry you're looking for in a node. This is not the case with a
B+tree. Since an ALNODE can only point to an ALSEC, you must always
proceed to the bottom of the tree, to a leaf, to retrieve extent
information.
<P>An ALNODE entry only contains two dwords: a running total indicating
the logical sector offset of the last sector in the ALSEC (i.e. how far we
are through the file - this starts from 1); the physical LSN of where to
find the ALSEC. The advantage of the smaller entry size of an ALNODE
compared to an ALLEAF is that, in the same space, there can be more of
them.
<P>An FNODE contains other data. One important piece of information is the
last 15 characters of the filename. This comes in handy when we need to
undelete. The last 316 bytes of the sector is also set aside for internal
ACL/EAs (stored completely within the FNODE). In the Graham Utilities
manual it is stated that up to 316 bytes of EAs can be stored within the
FNODE but my experiments with OS/2 Warp v3 show that only up to 145 bytes
of EAs can be internalised. Refer to Part 6 for further information.
<P>Figure 3 shows the structure of an ALSEC. You will notice that there is
much more space in the sector devoted to ALNODE/ALSEC entries then is
available in an FNODE sector (480 bytes compared to 96 bytes). This leads
to the following maximum number of entries:
<PRE>
ALLEAF ANODE
FNODE 8 12
ALSEC 40 60
</PRE>
<PRE>
Offset Data Size Comment
hex (dec) bytes
Header
00h (1) Signature 4 0x37E40AAE
04h (5) This block's LSN 4 Helps when placing other blks nearby.
08h (9) Parent's LSN 4 Points to either FNODE or another ALSEC.
0Ch (13) Btree Flag 1 0x20 (5) Parent is an FNODE, else ALSEC.
0x80 (7) ALNODEs follows, else ALLEAFs.
0Dh (14) Padding 3 Reestablish dword alignment.
10h (17) Free Entries 1 Number of free array entries.
11h (18) Used Entries 1 Number of used array entries.
12h (19) Free Ent. Offset 2 Offset to first free entry.
If ALLEAFs (Maximum of 40 in an ALSEC)
Extent #0
14h (21) Logical LSN 4 Sec offset of this extent within file.
Zero-based.
18h (25) Run Size 4 Secs in this extent.
1Ch (29) Physical LSN 4 File: LSN of extent start.
Dir: This B-tree's topmost DIRBLK LSN.
...
Extent #39
1E8h (489) Logical LSN 4
1ECh (493) Run Size 4
1F0h (497) Physical LSN 4
If ALNODEs (Maximum of 60 in an ALSEC)
Extent #0
14h (21) End Sector Count 4 Running total of secs mapped by this
alnode. 1-based. If EOF is within this
alnode then field contains 0xFFFF.
18h (25) Physical LSN 4 File: LSN of ALSEC.
Dir: This B-tree's topmost DIRBLK LSN.
...
Extent #59
1ECh (493) End Sector Count 4
1F0h (497) Physical LSN 4
Tail
1F4h (501) Padding 12 Unused.
</PRE>
<FONT SIZE=2>
Figure 3: The layout of an ALSEC. This component can contain either an
array of ALNODE or ALLEAF entries.
</FONT>
<H2>Some Examples</H2>
The main program this month, ShowExtents.cmd (to be discussed later),
needs to know the LSN of the FNODE or ALSEC that you want to start with.
It would be possible to design a version that accepted the full pathname
of a file but it would be a larger program. For the purpose of
comprehending these structures, the requirement of having to specify a LSN
is acceptable. To determine the file's FNODE location use last month's
ShowBtree.cmd. Figure 4 shows ShowBtree's output on a 1 MB partition after
"MakeExtents 7" was issued. From the information reported in Figure 4 we
will first examine the TEST directory's FNODE. Figure 5 shows the result
of issuing "ShowExtents N: 1033". Since there is no information in the
allocation array area of a directory FNODE (the 128 byte region commencing
at decimal offset 65), ShowExtents is designed to terminate early in such
a situation.
<PRE>
Root Directory:
1016-1019 Next Byte Free: 125 Topmost DirBlk
This directory's FNODE: 1032 (\ [level 1]) 1016-&gt;1032
**************************************************
SD 21 #00: .. FNODE:1032
D 57 #01: test FNODE:1033
E 93 #02:
36-39 Next Byte Free: 409 Topmost DirBlk
This directory's FNODE: 1033 (test [level 1]) 36-&gt;1033
**************************************************
SD 21 #00: .. FNODE:1033
57 #01: file00002 FNODE:432
97 #02: file00004 FNODE:664
137 #03: file00006 FNODE:896
177 #04: file00008 FNODE:1154
217 #05: file00010 FNODE:1386
257 #06: file00012 FNODE:1618
297 #07: file00014 FNODE:1850
337 #08: fileFRAGG FNODE:316
E 377 #09:
</PRE>
<FONT SIZE=2>
Figure 4: Last month's program, ShowBtree.cmd, shows the LSN of
FileFRAGG's FNODE.
</FONT>
<PRE>
FNODE STRUCTURE
LSN: 1033
Signature: F7E40AAE
Name Length: 4
Name: test
Container Dir LSN: 1032
EA Ext. Run Size: 0
EA LSN: 0
EA Int. Size: 0
EA ALSEC Flag: 0
Dir Flag: Directory FNODE
Topmost DIRBLK LSN: 36
</PRE>
<FONT SIZE=2>
Figure 5: ShowExtents' output when displaying the contents of a directory
FNODE.
</FONT>
<P>Next, we'll look at an FNODE with a full complement of 8 ALLEAF
entries. On my system, this is produced when "MakeExtents 7" is issued.
See Figure 6. The next free entry in the array of ALLEAF entries is at
offset 104 dec. Since the start point for this offset is counted from 65
dec, this means that the next entry would start at 169 dec. This is
actually past the end of the available entry area, at the beginning of the
tail region. This is another indication that the array is full. (The main
indication is the "0" value in the Free Entries field.)
<PRE>
FNODE STRUCTURE
LSN: 316
Signature: F7E40AAE
Name Length: 9
Name: fileFRAGG
Container Dir LSN: 1033
EA Ext. Run Size: 0
EA LSN: 0
EA Int. Size: 0
EA ALSEC Flag: 0
Dir Flag: File FNODE
B+tree Info Flag: ALLEAFs follow
Free Entries: 0
Used Entries: 8
Next Free Offset: 104
Valid data size: 420352
"Needed" EAs: 0
EA/ACL Int. Off: 0
ALLEAF INFORMATION
Extent #0: 115 sectors starting at LSN 317 (file sec offset:0)
Extent #1: 116 sectors starting at LSN 548 (file sec off:115)
Extent #2: 116 sectors starting at LSN 780 (file sec off:231)
Extent #3: 116 sectors starting at LSN 1038 (file sec off:347)
Extent #4: 116 sectors starting at LSN 1270 (file sec off:463)
Extent #5: 116 sectors starting at LSN 1502 (file sec off:579)
Extent #6: 116 sectors starting at LSN 1734 (file sec off:695)
Extent #7: 10 sectors starting at LSN 1966 (file sec off:811)
</PRE>
<FONT SIZE=2>
Figure 6: A FNODE with a full ALLEAF array.
</FONT>
<P>If we need to map any more extents we must switch from a FNODE (with
ALLEAFs) structure to FNODE (with ALNODEs) -&gt; ALSEC (with ALLEAFs). Figure
7 shows the mapping of a 10-extent file ("MakeExtents 8"). The B+tree Info
Flag tells us that the FNODE contains an array of ALNODEs. There is only
one entry in this array. The End Sector Count value is not shown here but,
in this example, you could easily check it out using Part 2's SEC.cmd
("SEC N: 316") and then look at the four bytes at offset 40h (in the case
of a single entry in the array). Since this is the sole entry, you will
find FFFFFFFFh (appears to be the array End-of-Entries indicator) at this
location.
<PRE>
FNODE STRUCTURE
LSN: 316
Signature: F7E40AAE
Name Length: 9
Name: fileFRAGG
Container Dir LSN: 1033
EA Ext. Run Size: 0
EA LSN: 0
EA Int. Size: 0
EA ALSEC Flag: 0
Dir Flag: File FNODE
B+tree Info Flag: ALNODEs follow
Free Entries: 11
Used Entries: 1
Next Free Offset: 16
Valid data size: 418304
"Needed" EAs: 0
EA/ACL Int. Off: 0
FNODE Entry #0
ALSEC STRUCTURE
Signature: 37E40AAE
This LSN: 933
Parent's LSN: 316
B+tree Info Flag: Parent was an FNODE; ALLEAFs follow
Free Entries: 30
Used Entries: 10
Next Free Offset: 128
ALLEAF INFORMATION
Extent #0: 101 sectors starting at LSN 317 (file sec off:0)
Extent #1: 102 sectors starting at LSN 520 (file sec off:101)
Extent #2: 102 sectors starting at LSN 724 (file sec off:203)
Extent #3: 102 sectors starting at LSN 1158 (file sec off:305)
Extent #4: 102 sectors starting at LSN 1362 (file sec off:407)
Extent #5: 102 sectors starting at LSN 1566 (file sec off:509)
Extent #6: 102 sectors starting at LSN 1770 (file sec off:611)
Extent #7: 42 sectors starting at LSN 1974 (file sec off:713)
Extent #8: 5 sectors starting at LSN 928 (file sec off:755)
Extent #9: 57 sectors starting at LSN 934 (file sec off:760)
</PRE>
<FONT SIZE=2>
Figure 7: A 10-extent file is mapped in a 1-level B+tree with a single
ALSEC.
</FONT>
<P>The next section in the display in Figure 7, labelled "FNODE Entry #0"
shows us that the sole ALNODE entry points to LSN 933. Here we are seeing
this ALSEC's layout. The B+tree Info Flag informs us that this ALSEC
contains ALLEAF entries i.e. the actual mapping of the extents. Notice
that we have 10 ALLEAF entries in the allocation array. Remember that an
ALSEC has much more space available for array entries than an FNODE has,
in that it can store up to 40 ALLEAF entries. You can verify this by
adding the ALSEC's Free Entries and the Used Entries values together.
<P>When you try and map more than 40 extents you will exceed the capacity
of a sole ALSEC. What happens in this case is that more ALNODE entries are
created in the FNODE, each pointing to an ALSEC. Figure 8 shows a
42-extent layout (produced with a parameter of "45").
<PRE>
FNODE STRUCTURE
LSN: 316
Signature: F7E40AAE
Name Length: 9
Name: fileFRAGG
Container Dir LSN: 1033
EA Ext. Run Size: 0
EA LSN: 0
EA Int. Size: 0
EA ALSEC Flag: 0
Dir Flag: File FNODE
B+tree Info Flag: ALNODEs follow
Free Entries: 10
Used Entries: 2
Next Free Offset: 24
Valid data size: 393192
"Needed" EAs: 0
EA/ACL Int. Off: 0
FNODE Entry #0
ALSEC STRUCTURE
Signature: 37E40AAE
This LSN: 588
Parent's LSN: 316
B+tree Info Flag: Parent was an FNODE; ALLEAFs follow
Free Entries: 0
Used Entries: 40
Next Free Offset: 232
ALLEAF INFORMATION
Extent #0: 16 sectors starting at LSN 317 (file sec off:0)
...
Extent #39: 17 sectors starting at LSN 1668 (file sec off:720)
FNODE Entry #1
ALSEC STRUCTURE
Signature: 37E40AAE
This LSN: 996
Parent's LSN: 316
B+tree Info Flag: Parent was an FNODE; ALLEAFs follow
Free Entries: 38
Used Entries: 2
Next Free Offset: 32
ALLEAF INFORMATION
Extent #40: 17 sectors starting at LSN 1702 (file sec off:737)
Extent #41: 14 sectors starting at LSN 1736 (file sec off:754)
</PRE>
<FONT SIZE=2>
Figure 8: 42 extents require a 1-level B+tree with 2 ALNODE entries in the
FNODE pointing to 2 ALSECs.
</FONT>
<P>There is space in an FNODE for 12 ALNODE entries. If each of these
points to a full ALSEC (with ALLEAFs) i.e. 40-entries each, this two-level
structure can accommodate 480 extents (parameter "564").
<P>Let's see what happens when we exceed this value. Figure 9 shows a
482-extent layout ("565"). Interesting things have occurred. We now have a
2-level B+tree structure. The FNODE ALNODE array has been adjusted to
contain a sole entry. This in turn points to an ALSEC that has 13 ALNODE
entries. Each of these ALNODE points to another ALSEC which contains
ALLEAF entries. 12 of the ALSECs (with ALLEAFs) are full i.e. 12*40 while
the 13th ALSEC (with ALLEAFs) only maps 2 extents.
<PRE>
FNODE STRUCTURE
LSN: 1000
Signature: F7E40AAE
Name Length: 9
Name: fileFRAGG
Container Dir LSN: 1033
EA Ext. Run Size: 0
EA LSN: 0
EA Int. Size: 0
EA ALSEC Flag: 0
Dir Flag: File FNODE
B+tree Info Flag: ALNODEs follow
Free Entries: 11
Used Entries: 1
Next Free Offset: 16
Valid data size: 524264
"Needed" EAs: 0
EA/ACL Int. Off: 0
FNODE Entry #0
ALSEC STRUCTURE
Signature: 37E40AAE
This LSN: 1333
Parent's LSN: 1000
B+tree Info Flag: Parent was an FNODE; ALNODEs follow
Free Entries: 47
Used Entries: 13
Next Free Offset: 112
ALNODE INFORMATION
ALSEC Entry #0 situated at LSN 328 (file sec count:582)
ALSEC STRUCTURE
Signature: 37E40AAE
This LSN: 328
Parent's LSN: 1333
B+tree Info Flag: ALLEAFs follow
Free Entries: 0
Used Entries: 40
Next Free Offset: 232 ALLEAF INFORMATION Extent #0-#39
ALNODE INFORMATION
ALSEC Entry #1 situated at LSN 394 (file sec count:622)
ALSEC STRUCTURE 394 (40) ALLEAF INFORMATION Extent #40-#79
ALNODE INFORMATION
ALSEC Entry #2 situated at LSN 476 (file sec count:662)
ALSEC STRUCTURE 476 (40) ALLEAF INFORMATION Extent #80-#119
ALNODE INFORMATION
ALSEC Entry #3 situated at LSN 558 (file sec count:702)
ALSEC STRUCTURE 558 (40) ALLEAF INFORMATION Extent #120-#159
ALNODE INFORMATION
ALSEC Entry #4 situated at LSN 640 (file sec count:742)
ALSEC STRUCTURE 640 (40) ALLEAF INFORMATION Extent #160-#199
ALNODE INFORMATION
ALSEC Entry #5 situated at LSN 722 (file sec count:782)
ALSEC STRUCTURE 722 (40) ALLEAF INFORMATION Extent #200-#239
ALNODE INFORMATION
ALSEC Entry #6 situated at LSN 804 (file sec count:822)
ALSEC STRUCTURE 804 (40) ALLEAF INFORMATION Extent #240-#279
ALNODE INFORMATION
ALSEC Entry #7 situated at LSN 886 (file sec count:862)
ALSEC STRUCTURE 886 (40) ALLEAF INFORMATION Extent #280-#319
ALNODE INFORMATION
ALSEC Entry #8 situated at LSN 968 (file sec count:902)
ALSEC STRUCTURE 968 (40) ALLEAF INFORMATION Extent #320-#359
ALNODE INFORMATION
ALSEC Entry #9 situated at LSN 1085 (file sec count:942)
ALSEC STRUCTURE 1085 (40) ALLEAF INFORMATION Extent #360-#399
ALNODE INFORMATION
ALSEC Entry #10 situated at LSN 1167 (file sec count:982)
ALSEC STRUCTURE 1167 (40) ALLEAF INFORMATION Extent #400-#439
ALNODE INFORMATION
ALSEC Entry #11 situated at LSN 1249 (file sec count:1022)
ALSEC STRUCTURE 1249 (40) ALLEAF INFORMATION Extent #440-#479
ALNODE INFORMATION
ALSEC Entry #12 situated at LSN 1331 (file sec count:At end)
ALSEC STRUCTURE 1331 (2) ALLEAF INFORMATION Extent #480-#481
</PRE>
<FONT SIZE=2>
Figure 9: 482 extents are mapped by a 2-level B+tree with 1 ALNODE entry
in the FNODE pointing to 1 ALSEC, which in turn points to 13 ALSECs.
</FONT>
<P>If you look at FNODE Entry #0's Used & Free Entries values you can
verify that, in an ALSEC, there can be a maximum of 60 ALNODEs. It would
take 60*40 = 2,400 extents to fill this level up again. Going past this
would require the presence of a second FNODE entry. Since we can have up
to 12 ALNODE entries in an FNODE, this means we could map 12*60*40 =
28,800 extents before the need to insert another intermediary ALSEC level
would arise.
<P>On a 100 MB partition I produced a 3-level 44,413 extent structure
("44500"). To put this discussion on B+tree fan-out in perspective, it
should be remembered that, in the fragmentation analysis performed in Part
3 on 20,800 files in 5 partitions, there were only 14 files with more than
8 extents (i.e. requiring an ALSEC) and the largest number of extents
reported was 30.
<H2>The ShowExtents Program</H2>
Figure 10 presents the ShowExtents.cmd REXX program. You will need to get
SECTOR.DLL. The program first determines if the LSN you've specified
belongs to an FNODE or ALSEC. (You can bypass the FNODE and commence the
examination from an ALSEC.) Once it has determined this, the next most
important consideration is: does the allocation array consist of ALLEAFs
or ALNODEs? If it contains ALLEAFs we've reached the end of the tree and
need only show the extents. If we are looking at an array of ALNODEs we
need to recurse down each ALNODE entry, loading the ALSEC pointed to by
the entry and then see whether it contains either ALLEAFs or ALNODEs. And
so on...
<PRE>
/*Shows the layout of FNODE and ALSECs. Requires SECTOR.DLL*/
PARSE UPPER ARG drive lsn
/* There must be at least two parms supplied */
IF drive = '' | lsn = '' THEN CALL HELP
/* Register external functions */
CALL RxFuncAdd 'QDrive','sector','QDrive'
CALL RxFuncAdd 'ReadSect','sector','ReadSect'
alleafEntryCount = 0
anodeEntryCount = 0
SAY
CALL MainRoutine
EXIT /*****************EXECUTION ENDS HERE*****************/
MainRoutine:
PROCEDURE EXPOSE drive lsn alleafEntryCount anodeEntryCount
usedEntries = 0
sectorString = ReadSect(drive,lsn) /* Read in required sec */
IF FourBytes2Hex(1) = 'F7E40AAE' THEN
/* Is an FNODE */
DO
alSecIndicator = ''
CALL DisplayFnode
END
ELSE
/* Not an FNODE */
DO
IF FourBytes2Hex(1) = '37E40AAE' THEN
/* Is an ALSEC */
DO
alSecIndicator = 'Y'
CALL DisplayALSEC
END
ELSE
/* Neither an FNODE or an ALSEC */
DO
SAY 'LSN' lsn 'is not an FNODE or ALSEC'
EXIT
END
END
RETURN
DisplayFnode:
SAY 'FNODE STRUCTURE'
SAY 'LSN: ' lsn
SAY 'Signature: ' FourBytes2Hex(1)
SAY 'Name Length: ' Bytes2Dec(13,1)
SAY 'Name: ' Substr(sectorString,14,Bytes2Dec(13,1))
SAY 'Container Dir LSN:' Bytes2Dec(29,4)
SAY 'EA Ext. Run Size: ' Bytes2Dec(45,4)
SAY 'EA LSN: ' Bytes2Dec(49,4)
SAY 'EA Int. Size: ' Bytes2Dec(53,2)
SAY 'EA ALSEC Flag: ' Bytes2Dec(55,1)
IF Bitand(Byte2Char(56),'1'x) = '1'x THEN
dirFlag = 'Directory FNODE'
ELSE
dirFlag = 'File FNODE'
SAY 'Dir Flag: ' dirFlag
IF dirFlag = 'Directory FNODE' THEN
SAY 'Topmost DIRBLK LSN:'||Bytes2Dec(73,4)
ELSE
DO
/* Is a file, so determine extents */
CALL DetermineBtreeInfo 57
SAY 'B+tree Info Flag: ' btreeInfo
SAY 'Free Entries: ' Bytes2Dec(61,1)
usedEntries = Bytes2Dec(62,1)
SAY 'Used Entries: ' usedEntries
SAY 'Next Free Offset: ' Bytes2Dec(63,2)
SAY 'Valid data size: ' Bytes2Dec(161,4)
SAY '"Needed" EAs: ' Bytes2Dec(165,4)
SAY 'EA/ACL Int. Off: ' Bytes2Dec(169,4)
CALL ShowALLEAF_or_ANODE
END
RETURN
FourBytes2Hex: /* Given offset, return Dword */
ARG startPos
rearranged = Reverse(Substr(sectorString,startPos,4))
RETURN C2X(rearranged)
Bytes2Dec:
ARG startPos,numOfChars
temp = Substr(sectorString,startPos,numOfChars)
IF C2X(temp) = 'FFFFFFFF' THEN
RETURN 'At the end'
ELSE
RETURN Format(C2D(Reverse(temp)),,0)
Byte2Char:
ARG startPos
RETURN Substr(sectorString,startPos,1)
DetermineBtreeInfo:
ARG btreeByteOffset
IF Bitand(Byte2Char(btreeByteOffset),'20'x) = '20'x THEN
btreeInfo = 'Parent was an FNODE; '
ELSE
btreeInfo = ''
IF Bitand(Byte2Char(btreeByteOffset),'80'x) = '80'x THEN
DO
btreeInfo = btreeInfo||'ALNODEs follow'
alNodeIndicator = 'Y'
END
ELSE
DO
btreeInfo = btreeInfo||'ALLEAFs follow'
alNodeIndicator = 'N'
END
RETURN
DisplayALSEC:
SAY 'ALSEC STRUCTURE'
alSecIndicator = 'Y'
SAY 'Signature: ' FourBytes2Hex(1)
lsn = Bytes2Dec(5,4)
SAY 'This LSN: ' lsn
SAY "Parent's LSN: " Bytes2Dec(9,4)
CALL DetermineBtreeInfo 13
SAY 'B+tree Info Flag: ' btreeInfo
SAY 'Free Entries: ' Bytes2Dec(17,1)
usedEntries = Bytes2Dec(18,1)
SAY 'Used Entries: ' usedEntries
SAY 'Next Free Offset: ' Bytes2Dec(19,1)
CALL ShowALLEAF_or_ANODE
RETURN
ShowALLEAF_or_ANODE: PROCEDURE EXPOSE drive lsn sectorString,
usedEntries alleafEntryCount anodeEntryCount entrySize,
alsecIndicator alnodeIndicator
IF alsecIndicator = 'Y' THEN
entryOffset = 21
ELSE
entryOffset = 65
IF alnodeIndicator \= 'Y' THEN
/* Is an ALLEAF */
DO
SAY
IF usedEntries = 0 THEN
DO
SAY 'Zero-length file'
EXIT
END
SAY 'ALLEAF INFORMATION'
entrySize = 12
DO entry = alleafEntryCount TO alleafEntryCount+usedEntries-1
fileSecOffset = Bytes2Dec(entryOffset,4)
runSize = Bytes2Dec(entryOffset+4,4)
physicalLSN = Bytes2Dec(entryOffset+8,4)
SAY 'Extent #'||entry||':' runSize 'sectors starting
at LSN' physicalLSN '(file sec offset:' ||fileSecOffset ||')'
/* Wrapped long line */
entryOffset = entryOffset+entrySize
END entry
alleafEntryCount = entry
END
ELSE
DO
/* Is either an ALNODE in an ALSEC or in an FNODE */
entrySize = 8
IF alSecIndicator \= 'Y' THEN
/* In an FNODE */
DO entry = anodeEntryCount TO anodeEntryCount+usedEntries-1
lsn = Bytes2Dec(entryOffset+4,4)
SAY
SAY 'FNODE Entry #' || entry
CALL MainRoutine
entryOffset = entryOffset+entrySize
END entry
ELSE
DO
/* In an ALSEC */
listStart = 65
sectorString = ReadSect(drive,lsn)
DO entry = anodeEntryCount TO anodeEntryCount+usedEntries-1
SAY
SAY 'ALNODE INFORMATION'
fileSecOffset = Bytes2Dec(entryOffset,4)
lsn = Bytes2Dec(entryOffset+4,4)
SAY 'ALSEC Entry #'||entry 'situated at LSN'
lsn '(file sec count:'|| fileSecOffset ||')'
/* Wrapped long line */
CALL MainRoutine
anodeEntryCount = entry
entryOffset = entryOffset+entrySize
END entry
END
END
RETURN
Help:
SAY 'ShowExtents shows the extents mapped by a FNODE or ALSEC'
SAY 'structure.'
SAY
SAY ' Usage: ShowExtents drive LSN_of_a_FNODE/ALSEC'
SAY ' Example: ShowExtents C: 316'
EXIT
</PRE>
<FONT SIZE=2>
Figure 10: The ShowExtents.cmd program.
</FONT>
<H2>Counting Extents</H2>
It is handy to be able to report just the number of extents in a file.
HPFS-EXT, in the Graham Utilities, can do this. It take a filename. It is
available in the demo version of the GU's, "GULITE.xxx".
<P>The freeware FST (currently FST03F.xxx) does just about everything. You
can specify either a filename ("FST INFO N: \TEST\FILEFRAGG" - note the
space after the drive letter) or a LSN ("FST INFO N: 1000"). It will
include the height of the B+tree and the total number of extents at the
end of its display. Unfortunately, it displays a lot of other info, and
sometimes you're only interesting in just the number of levels and
extents.
<P>I cut down ShowExtents.cmd to produce CountExtents.cmd The design was
not amenable to showing the height but it was a straightforward matter to
show just the number of extents. I've not bothered to present it here
since most readers will probably prefer to specify the filename. (The
FNODE LSN keeps changing as you increase the number of extents so this
makes it more difficult to use CountExtents.)
<H2>Conclusion</H2>
In this installment we have seen how a file's sectors are mapped by FNODEs
and ALSECs. These file system components can contain either an array of
ALNODE or ALLEAF entries. By following through to the ALLEAFs we can
examine the mapping of extents.
<P>We have also seen how a B+tree is different from a B-tree. In a DIRBLK
B-tree, DIRENT information can be found in a node entry. But in an ALSEC
B+tree, extent information is not stored in node entries, only in the
leaves. The filling of nodes in an ALSEC B+tree is also much more
efficient than the utilisation of nodal space in a DIRENT's B-Tree.
<P>When the next installment is presented we'll look at Extended
Attributes. While not specifically a HPFS topic, they are well integrated
into the file system and will fit well into this series.

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.0 KiB

View File

@@ -0,0 +1,53 @@
<html><head><title>Operating Systems: The HPFS Filesystem</title></head>
<body BGCOLOR=#FFFFFF TEXT=#000000 LINK="#0000FF" VLINK="#0000FF" ALINK="#107010">
<center><font face=Verdana size=7><b>HPFS FileSystem</b></font></center>
<hr><p>
This series of articles apparently originally appeared in now defunct OS2Zone (Their page should be at http://www.os2zone.aus.net) written by Dan Bridges. I ran across it during my journeys of the net, and put it up here... The "original" form is <a href="hpfs.zip">available here</a>. This is a six part series of articles on HPFS.<p>
<ul><DL>
<DT><font size=+1><a href="hpfs0.html">Part #0 - Preface</a></font><br>
<DD>This article is the initial "preface" article that explains the motivations behind the series.
It also talks about the filesystem organization scheme used by the FAT filesystem... and briefly
introduces HPFS.<p>
<DT><font size=+1><a href="hpfs1.html">Part #1 - Introduction</a></font><br>
<DD>This introductory article compares the FAT filesystem against the HPFS filesystem in terms that
a user would understand. This talks about the practical differences, such as speed, size, and
fragmentation.<p>
<DT><font size=+1><a href="hpfs2.html">Part #2 - The SuperBlock and the SpareBlock</a></font><br>
<DD>This article starts delving more deeply into HPFS' internal structures. Two REXX programs are
presented that greatly assist in the search for information. It also briefly looks at some
other HPFS-related programs. Finally, you will see the Big Picture when the major structures
of a HPFS partition are shown. <p>
<DT><font size=+1><a href="hpfs3.html">Part #3 - Fragmentation, Diskspace Bitmaps and Code Pages</a></font><br>
<DD>This article looks at how HPFS knows which sectors are occupied and which ones are free.
It examines the amount of file fragmentation on five HPFS volumes and also checks out the
fragmentation of free space. A program is presented to show free runs and some other
details. Finally, it briefly discusses Code Pages and looks at a program that displays
their contents.<p>
<DT><font size=+1><a href="hpfs4.html">Part #4 - B-Trees, DIRBLKs, and DIRENTs</a></font><br>
<DD>The most basic structures in the HPFS are DIRBLKs, DIRENTs and FNODEs. This article examines
DIRBLKs and DIRENTs, talks about the differences between binary trees and B-trees and shows
how DIRBLKs are interconnected to facilitate quick access in a large directory (one of HPFS'
strengths). To assist in this investigation, a program, ShowBtree.cmd, helps to visualise
the layout of directory and file entries in a partition.<p>
<DT><font size=+1><a href="hpfs5.html">Part #5 - FNODEs, ALSECs and B+trees</a></font><br>
<DD>This article takes a long look at how a file's contents are logically stored under HPFS.
It is helpful to contrast the following file-sector allocation methods with last articles's
directory entry concepts. It also talks about fragmentation and how HPFS deals with it.<p>
<DT><font size=+1>Part #6 - ?</font><br>
<DD>This is as far as I can go... if anyone has any of the other articles that appeared in this
series, please please send them my way...<p>
</DL></ul>
<p><hr><FONT SIZE = 4><TABLE ALIGN=RIGHT BORDER=0><TR><TD><center>
Copyright &copy; 1998 <i><a href="mailto:sabre@nondot.org">Chris Lattner</a></i><br>
Last modified: Wednesday, 13-Sep-2000 14:10:50 CDT </center></TD></TR></TABLE>

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,593 @@
<html>
<head>
<title>Joliet Specification</title>
</head>
<body bgcolor="#ffffff">
<a name="top"></a><center>
<h1>Joliet Specification</h1>
<b>
<p>CD-ROM Recording Spec ISO 9660:1988</b> </center> <br>
</p>
<p>Extensions for Unicode Version 1; May 22, 1995 </p>
<p>Copyright 1995, Microsoft Corporation All Rights Reserved <br>
Contact Microsoft Developer Relations Group <br>
MAC@avca.com </p>
<hr>
<h2><a name="contents">CONTENTS</a></h2>
<ul>
<li><a href="#preface">Preface</a>
<ul>
<li><a href="#scope">Purpose and Scope</a> </li>
<li><a href="#overview">Overview </a> </li>
<li><a href="#terms">Terminology and Notation</a> </li>
</ul>
</li>
<li><a href="#recording">Joliet Recording Specification</a>
<ul>
<li><a href="#change">Change Summary </a> </li>
</ul>
</li>
<li><a href="#unicode">Identifying an ISO 9660 SVD as Unicode
(UCS-2)</a>
<ul>
<li><a href="#escapes">SVD Escape Sequences Field</a> </li>
<li><a href="#flags">SVD Volume Flags Field</a> </li>
<li><a href="#resolution">Resolution of ISO 9660
Ambiguities for Wide Characters </a> </li>
<li><a href="#wide">Wide Character Byte Ordering</a> </li>
<li><a href="#allowed">Allowed Character Set </a> </li>
<li><a href="#identifiers">Special Directory
Identifiers</a> </li>
<li><a href="#separator">Separator Characters</a> </li>
<li><a href="#sort">Sort Ordering</a> </li>
<li><a href="#relaxation">Relaxation of ISO 9660
Restrictions on UCS-2 Volumes </a> </li>
</ul>
</li>
<li><a href="#extension">Extensions to Joliet</a>
<ul>
<li><a href="#multisession">Joliet for Multisession
Media</a> </li>
<li><a href="#cdxa">CD-XA Extensions to Joliet</a> </li>
<li><a href="#other">Other Extensions to Joliet </a> </li>
</ul>
</li>
<li><a href="#bibliography">Bibliography 14 </a> </li>
</ul>
<h2><a name="preface">Preface</a></h2>
<h3><a name="scope"></a>Purpose and Scope </h3>
<p>While the CD-ROM media provides for cost-effective software
distribution, the existing ISO 9660 file system contains a number
of restrictions which interfere with simple and efficient
distribution of files on a CD-ROM. </p>
<p>The read-only nature of CD-ROM media has led content authors
to continue to use traditional magnetic media as their main
avenue for creating applications. Each of the existing file
systems for magnetic media contain various features which can not
be represented on CD-ROM media using an unenhanced version of ISO
9660. </p>
<p>As content authors attempt to transfer their applications to
the CD-ROM, they are likely to find that some of their work
cannot be distributed on the CD-ROM media due to restrictions in
the ISO 9660 format. This frustrates some content authors. </p>
<p>Because the CD-ROM media is mainly a distribution media,
rather than a creative (read/write) media, it is necessary for
the CD-ROM file system to support a superset of the creative
media features. This fundamental flaw in the design of ISO 9660
has prompted several operating systems vendors to extend ISO 9660
in several ways. Some examples are Rock Ridge Interchange
Protocol and Apple's use of the System Use Area to store finder
flags. </p>
<p>Some of the ISO 9660 problems which are addressed by this
specification include: </p>
<ul>
<li>Character Set limitations. </li>
<li>File Name Length limitations </li>
<li>Directory Tree Depth limitations </li>
<li>Directory Name Format limitations </li>
<li>Wide Character (16-bit character) ambiguities </li>
</ul>
<p>The general design approach used in the Joliet specification
is to relax restrictions and resolve ambiguities in the ISO
9660:1988 specification so the practical goals can be met. </p>
<h3><a name="overview"></a>Overview </h3>
<p>The Joliet specification utilizes the supplementary volume
descriptor (SVD) feature of ISO 9660 to specify a set of files
recorded within the Unicode character set. </p>
<p>The ISO 10646 character set specification may be identified by
an ISO 2022 escape sequence. By recording this escape sequence in
an ISO 9660 SVD, this technique for identifying the Unicode SVD
is compliant with the ISO 9660 specification. It also retains
interchange by not disrupting the files referenced through the
primary volume descriptor (PVD). </p>
<p>All that remains is to resolve minor technical ambiguities
within ISO 9660 which arise as the result of the use of wide
characters. </p>
<p>Because the use of this particular escape sequence in an ISO
9660 SVD is unprecedented up to this time, several of the
restrictions which are imposed by ISO 9660 may be relaxed without
significantly disrupting information interchange between existing
systems from a practical standpoint. </p>
<p>This design approach has several benefits. For instance, the
use of the existing ISO 9660 standard allows for straightforward
integration with existing extensions to ISO 9660. The designs for
the System Use Sharing Protocol, Rock Ridge extensions for POSIX
semantics, CD-XA System Use Area Semantics, Apple's Finder Flags
and Resource Forks, all port in a straightforward manner to the
Joliet specification. </p>
<p>Also, the use of a new SVD eliminates the danger of breaking
software compatibility with existing ISO 9660 systems. Existing
software will simply ignore the Unicode SVD, and will simply use
the PVD instead. This compatibility &quot;safety-valve&quot;
makes the goal of relaxing the file system's restrictions easier. </p>
<p>This document describes how a CD-ROM may be constructed so
that names on the volume can be recorded in Unicode while
remaining in compliance with ISO 9660. The particular ISO 10646
character sets used here are UCS-2 Level 1, UCS-2 Level 2, and
UCS-2 Level 3. </p>
<p>The basic strategy of CD-ROM volume recognition is the Volume
Recognition Sequence, which is a sequence of volume descriptors,
recorded one per sector, starting at Sector 16 in the first track
of the last session on the disc. A receiving system reads these
sectors and chooses a particular volume descriptor from the
sequence. This volume descriptor acts as a kind of anchor upon
which the remainder of the volume is constructed. </p>
<h3><a name="terms">Terminology and Notation</a></h3>
<p>Joliet is based on the ISO 9660:1988 standard. Unless defined
in this document, the terminology used shall be as defined in ISO
9660:1988. </p>
<p>The following notation is used in this document. </p>
<ul>
<li>Decimal and Hexadecimal Notation
<ul>
<li>Numbers in decimal notation are represented by
decimal digits, namely 0 to 9. </li>
<li>Numbers in hexadecimal notation are represented
by hexadecimal digits, namely 0 to 9 and A to F,
shown in parentheses. For instance, the
hexadecimal number D0 shall be written as (D0). </li>
</ul>
</li>
<li>A literal sequence of ASCII characters will be
represented by those characters within single quotes. For
instance, 'ABC' means the byte sequence (41)(42)(43). </li>
<li>References to characters in the ISO 2022 escape sequence
will be given in comma-separated decimal nibble/nibble
format, in hexadecimal format, and as ASCII characters,
with equal signs between each format, all enclosed within
parenthesis. For instance, the 3-byte ISO 2022 escape
sequence for Shift-JIS is (2/4, 2/11, 3/10 =
(24)(2B)(3A)= '$+:'). </li>
</ul>
<p><a name="recording"></a><a href="#contents">return to the
table of contents</a> </p>
<h2>Joliet Recording Specification</h2>
<h3><a name="change"></a>Change Summary</h3>
<p>The Joliet specification resolves the following ISO 9660
ambiguities for UCS-2 volumes: </p>
<ul>
<li>Use a SVD with a UCS-2 (UNICODE) Escape Sequence. </li>
<li>The UCS-2 escape sequences used are: (25)(2F)(40),
(25)(2F)(43), or (25)(2F)(45). </li>
<li>The default setting of bit 0 of the SVD &quot;Volume
Flags Field&quot; is ZERO. </li>
<li>The Unicode Wide characters shall be recorded in
&quot;Big Endian&quot; (Motorola) format. </li>
<li>Special Directory Identifiers are recorded as single byte
names containing (00) or (01). </li>
<li>SEPARATOR 1 and SEPARATOR 2 are encoded using an
equivalent 16-bit code point. </li>
<li>Sort ordering is unchanged, except that all justification
pad bytes are to be set to (00). </li>
</ul>
<p>The Joliet specification recommends that several ISO 9660
restrictions be lifted on UCS-2 volumes. The Joliet specification
allows for the following interchange rules: </p>
<ul>
<li>The File or Directory Identifiers may be up to 128 bytes
(64 unicode characters) in length. </li>
<li>Directory Identifiers may contain file name extensions. </li>
<li>The Directory Hierarchy may be recorded deeper than 8
levels. </li>
<li>The volume recognition sequence supports multisession.
This is compatible with the CD-Bridge specification. </li>
</ul>
<p>The Joliet specification may be extended through the use of
the following specifications: </p>
<ul>
<li>Mode 2 Form 2 extents and CD-DA extents, (&quot;System
Description CD-ROM XA&quot;) </li>
<li>System Use Sharing Protocol (not explicitly specified
here) </li>
<li>RockRidge Interchange Protocol (not explicitly specified
here) </li>
<li>Other future CD-ROM file system formats </li>
</ul>
<p> <a name="unicode"></a> </p>
<p><a href="#contents">return to the table of contents</a> </p>
<h2>Identifying an ISO 9660SVD as Unicode (UCS-2)</h2>
<h3><a name="escapes">SVD Escape Sequences Field</a></h3>
<p>The Escape Sequences field of an ISO 9660 Supplementary Volume
Descriptor (ISO 9660 section 8.5.6) shall identify the character
set used to interpret descriptor fields related to the Directory
Hierarchy identified by the Volume Descriptor. </p>
<p>If the Escape Sequences field of an ISO 9660 SVD identifies
any of the following UCS-2 escape sequences, then the descriptor
fields related to the Directory Hierarchy identified by that
Volume Descriptor shall be interpreted according to the
identified UCS-2 character set. </p>
<p> </p>
<hr>
<b>
<p>Table 1 - ISO 2022 UCS-2 Escape Sequences</b> </p>
<pre>
ISO 2022 Escape Sequence as recorded in the ISO 9660 SVD
Standard Level Decimal Hex Bytes ASCII
UCS-2 Level 1 2/5, 2/15, 4/0 (25)(2F)(40) '%\@'
UCS-2 Level 2 2/5, 2/15, 4/3 (25)(2F)(43) '%\C'
UCS-2 Level 3 2/5, 2/15, 4/5 (25)(2F)(45) '%\E'
</pre>
<hr>
<p>A &quot;Unicode Volume&quot; refers to the Volume Descriptor
and Directory Hierarchy identified by a Supplementary Volume
Descriptor containing an Escape Sequences field which identifies
any of the above UCS-2 character sets. </p>
<h3><a name="flags">SVD Volume Flags Field</a></h3>
<p>The UCS-2 Level 1, UCS Level 2, and UCS-2 Level 3 escape
sequences are considered to be registered according ISO 2735 for
purposes of setting bit 0 of the Volume Flags field of the SVD. </p>
<p>The nominal value of Bit 0 of the Volume Flags field for a
Unicode SVD shall be ZERO. </p>
<h3><a name="resolution">Resolution of ISO 9660 </a>Ambiguities
for Wide Characters</h3>
<p>This specification resolves ISO 9660 ambiguities with respect
to wide (16-bit) character sets, such as the UCS-2 character set. </p>
<h3><a name="wide">Wide Character Byte Ordering</a> </h3>
<p>All UCS-2 characters shall be recorded according to ISO
9660:1988 section 7.2.2, 16-bit numerical value, most significant
byte first (&quot;Big Endian&quot;). </p>
<h3><a name="allowed">Allowed Character Set</a> </h3>
<p>All UCS-2 code points shall be allowed except for the
following UCS-2 code points: </p>
<ul>
<li>All code points between (00)(00) and (00)(1F), inclusive.
(Control Characters) </li>
<li>(00)(2A) '*'(Asterisk) </li>
<li>(00)(2F) '/' (Forward Slash) </li>
<li>(00)(3A) ':' (Colon) </li>
<li>(00)(3B) ';' (Semicolon) </li>
<li>(00)(3F) '?' (Question Mark) </li>
<li>(00)(5C) '\' (Backslash) </li>
</ul>
<p><a name="identifiers"></a> </p>
<p><a href="#contents">return to the table of contents</a> </p>
<h3>Special Directory Identifiers </h3>
<p>Section 7.6 of ISO 9660 describes the recording of reserved
directory identifiers for the root, current, and parent directory
identifiers as single (00) or single (01) bytes. </p>
<p>In a wide character set, it is not possible to represent a
character in a single byte. The following portions of the ISO
9660:1988 specification referring to reserved directory
identifiers are ambiguous. </p>
<p>The ISO 9660:1988 sections in question are as follows: </p>
<ul>
<li>6.8.2.2 Identification of directories </li>
<li>7.6.2 Reserved Directory Identifiers </li>
<li>9.1.11 File Identifier </li>
<li>9.4.5 Directory Identifier </li>
</ul>
<p>These special case directory identifiers are not intended to
represent characters in a graphic character set. These characters
are placeholders, not characters. Therefore, these definitions
remain unchanged on a volume recorded in Unicode. </p>
<p>Simply put, Special Directory Identifiers shall remain as
8-bit values, even on a UCS-2 volume, where other characters have
been expanded to 16-bits. </p>
<dl>
<dt>Root Directory </dt>
<dt><dfn>The Directory Identifier of a Directory Record
describing the Root Directory shall consist of a single
(00) byte.</dfn> </dt>
<dt>Current Directory </dt>
<dt><dfn>The Directory Identifier of the first Directory
Record of each directory shall consist of a single (00)
byte.</dfn> </dt>
<dt>Parent Directory </dt>
<dt><dfn>The Directory Identifier of the second Directory
Record of each directory shall consist of a single (01)
byte.</dfn> </dt>
</dl>
<h3><a name="separator">Separator Characters</a> </h3>
<p>The separator characters SEPARATOR 1 and SEPARATOR 2 are
specified as 8-bit characters, which can not be represented in a
wide character set, so the ISO 9660:1988 specification sections
referring to SEPARATOR 1 and SEPARATOR 2 are ambiguous. </p>
<p>The ISO 9660:1988 sections in question are as follows: </p>
<ul>
<li>7.4.3 Separators </li>
<li>7.5.1 File Identifier format </li>
<li>7.5.2 File Identifier length </li>
<li>8.4.24 Abstract File Identifier </li>
<li>8.4.25 Bibliographic File Identifier </li>
<li>8.5.17 Copyright File Identifier </li>
<li>8.5.19 Bibliographic File Identifier </li>
<li>9.1.11 File Identifier </li>
</ul>
<p>The values SEPARATOR 1 and SEPARATOR 2 shall be represented
differently depending on the d1 character set. </p>
<p>In the case of an SVD identifying a UCS-2 character set, the
values of SEPARATOR 1 and SEPARATOR 2 shall be recorded as a
UCS-2 character with an equivalent code point value. </p>
<p>Otherwise, the definitions of SEPARATOR 1 and SEPARATOR 2
shall be recorded according to section 7.4.3 of ISO 9660:1988. </p>
<p>Simply put, SEPARATOR 1 and SEPARATOR 2 shall be expanded to
16-bits. </p>
<p> </p>
<hr>
<b>
<p>Table 2 - Separator Representations</b> </p>
<pre>
ISO 9660:1988 Volume Unicode Volume
Separator Bit Combination UCS-2 Codepoint
SEPARATOR 1 (2E) (00)(2E)
SEPARATOR 2 (3B) (00)(3B)
</pre>
<hr>
<p><a name="sort"></a><a href="#contents">return to the table of
contents</a> </p>
<h3>Sort Ordering</h3>
<p>ISO 9660 specifies the order of path table records within a
path table, and specifies the order of directory records within a
directory. These sorting algorithms assume an 8-bit character set
is used. These sorting algorithms are ambiguous when used with
wide characters. </p>
<p>The ISO 9660:1988 sections in question are as follows: </p>
<ul>
<li>6.9.1 Order of Path Table Records </li>
<li>9.3 Order of Directory Records </li>
</ul>
<p>The only change required is to redefine the value of the sort
justification pad byte to zero (00). </p>
<p>Simply put, comparing the byte contents in all positions
remains a suitable sorting algorithm for the descriptor fields
recorded in a UCS-2 SVD Directory Hierarchy. This is one of the
primary reasons for selecting the Big Endian format to represent
all UCS-2 characters. </p>
<p><b>Natural Language Sorting</b> </p>
<p>On a Unicode volume, the 16-bit UCS-2 code points are used to
determine the Order of Path Table Records and the Order of
Directory Records. </p>
<p>No attempt will be made to provide natural language sorting on
the media. Natural language sorting may optionally be provided by
a display application as desired. </p>
<p><b>Justification Pad Bytes</b> </p>
<p>The sort ordering algorithms as specified in ISO 9660:1988
sections 6.9.1 and 9.3 are acceptable except for the value of the
justification &quot;pad byte&quot;. </p>
<p>The value of the justification &quot;pad byte&quot; as
specified in ISO 9660:1988 section 6.9.1 shall be (00). This is
changed from a value of (20) as specified in that same section. </p>
<p>The value of the justification &quot;pad byte&quot; as
specified in ISO 9660:1988 section 9.3 subsections (a) and (b)
shall be (00). This is changed from a value of (20) as specified
in those same sections. </p>
<p>The value of the justification &quot;pad byte&quot; as
specified in ISO 9660:1988 section 9.3 subsections (c) shall be
(00). This is changed from a value of (30) as specified in that
same section. </p>
<p>Simply put, set all the justification &quot;pad bytes&quot; to
zero to simplify sorting. </p>
<p> <b>Mandatory Sort Ordering.</b> </p>
<p>Correct sort ordering is mandatory on UCS-2 volumes. </p>
<p><b>Descriptor Fields affected by the UCS-2 Escape Sequence</b> </p>
<p>If a UCS-2 escape sequence is detected in a supplementary
volume descriptor, the following descriptor fields referenced
from that supplementary volume descriptor shall contain UCS-2
characters. </p>
<ul>
<li>ISO 9660:1988 Section 8.5.4 System Identifier </li>
<li>ISO 9660:1988 Section 8.5.5 Volume Identifier </li>
<li>ISO 9660:1988 Section 8.5.13 Volume Set Identifier </li>
<li>ISO 9660:1988 Section 8.5.14 Publisher Identifier </li>
<li>ISO 9660:1988 Section 8.5.15 Data Preparer Identifier </li>
<li>ISO 9660:1988 Section 8.5.16 Application Identifier </li>
<li>ISO 9660:1988 Section 8.5.17 Copyright File Identifier </li>
<li>ISO 9660:1988 Section 8.5.18 Abstract File Identifier
(missing section) </li>
<li>ISO 9660:1988 Section 8.5.19 Bibliographic File
Identifier </li>
<li>ISO 9660:1988 Section 9.1.11 File Identifier </li>
<li>ISO 9660:1988 Section 9.4.5 Directory Identifier </li>
<li>ISO 9660:1988 Section 9.5.11 System Identifier (of
Extended Attribute Record) </li>
</ul>
<p><a name="relaxation"></a><a href="#contents">return to the
table of contents</a> </p>
<h3>Relaxation of ISO 9660 Restrictions on UCS-2 Volumes </h3>
<p>Several ISO 9660 restrictions will be relaxed to achieve a
more useful recording specification. Joliet receiving systems
shall be capable of receiving media recorded with restrictions
which have been relaxed relative to ISO 9660. </p>
<p> <b>Maximum File Identifier Length Increased</b> </p>
<p>Joliet receiving systems shall receive directory hierarchies
recorded with file identifiers longer than those allowed by ISO
9660 receiving systems. </p>
<p>ISO 9660 (Section 7.5.1) states that the sum of the following
shall not exceed 30: </p>
<ul>
<li>If there is a file name, the length of the file name. </li>
<li>If there is a file name extension, the length of the file
name extension. </li>
</ul>
<p>On Joliet compliant media, however, the sum as calculated
above shall not exceed 128, to allow for longer file identifiers. </p>
<p>The above lengths shall be expressed as a number of bytes. </p>
<p><b>Maximum Directory Identifier Length Increased</b> </p>
<p>Joliet receiving systems shall receive directory hierarchies
recorded with file names longer than those allowed by ISO 9660
receiving systems. </p>
<p>ISO 9660 (Section 7.6.3) states that the length of a directory
identifier shall not exceed 31. </p>
<p>On Joliet compliant media, however, the length of a directory
identifier shall not exceed 128, to allow for longer directory
identifiers. </p>
<p>The above lengths shall be expressed as a number of bytes. </p>
<p> <b>Directory Names May Have File Name Extensions</b> </p>
<p>ISO 9660 does not allow directory identifiers to contain file
name extensions. </p>
<p>On Joliet compliant media, however, directory identifiers may
contain file name extensions. </p>
<p>The Joliet directory identifier format shall be calculated
according to ISO 9660 section 7.5.1 &quot;File Identifier
format&quot;, with the exception that the length of a directory
identifier may exceed 31, but shall not exceed 128. </p>
<p>In addition, the Joliet directory identifier format shall
comply with ISO 9660 section 7.6.2 &quot;Reserved Directory
Identifiers&quot;. </p>
<p>The directory identifier length shall be calculated according
to ISO 9660 section 7.5.2 &quot;File Identifier length&quot;. </p>
<p>The above lengths shall be expressed as a number of bytes. </p>
<p>Maximum Directory Hierarchy Depth May Exceed 8 Levels </p>
<p>ISO 9660 (Section 6.8.2.1) specifies restrictions regarding
the Depth of Directory Hierarchy. This section of ISO 9660
specifies that this number of levels in the hierarchy shall not
exceed eight. </p>
<p>On Joliet compliant media, however, the number of levels in
the hierarchy may exceed eight. </p>
<p>Joliet compliant media shall comply with the remainder of ISO
9660 section 6.8.2.1, so that for each file recorded, the sum of
the following shall not exceed 240: </p>
<ul>
<li>the length of the file identifier; </li>
<li>the length of the directory identifiers of all relevant
directories; </li>
<li>the number of relevant directories. </li>
</ul>
<p>The above lengths shall be expressed as a number of bytes. </p>
<p><a name="extension"></a><a href="#contents">return to the
table of contents</a> </p>
<h2>Extensions to Joliet </h2>
<h3><a name="multisession"></a>Joliet for Multisession Media </h3>
<p>Multisession Recordings are Received </p>
<p>When provided with CD-ROM reader hardware with multisession
capability, Joliet receiving systems shall receive media recorded
using the multisession recording technique. </p>
<p>The details of this technique are provided below </p>
<p><b>Logical Sector Addressing on Multisession Recordings</b> </p>
<p>Each sector on the media is assigned a unique Logical Sector
Address. </p>
<p>Logical Sector Addresses zero and above increase linearly
across the surface of the disc, regardless of session boundaries. </p>
<p>Logical Sector Address zero references the sector with
Minute:Second:Frame address 00:02:00 in the first session. All
other Logical Sector Addresses are relative to
Minute:Second:Frame address 00:02:00 in the first session. </p>
<p>The conversion between Logical Sector Addresses and
Minute:Second:Frame addresses is Logical Sector Address =
(((Minute*60)+Seconds)*75) - 150. </p>
<p>Simply put, the Logical Sector Address on a multisession disc
describes a flat address space. </p>
<p> <b>Multisession Addressability</b> </p>
<p>The data area for a volume may span multiple sessions. </p>
<p>For example, if a disc is recorded with 3 sessions, the
directory hierarchy described by a volume descriptor in session 3
may reference logical sectors recorded in session 1, 2, or 3. </p>
<p><b>Multisession Volume Recognition Sequence</b> </p>
<p>The Volume Recognition Sequence shall begin at the 16th
logical sector of the first track of the last session on the
disc. </p>
<p>This volume recognition sequence supersedes all other volume
recognition sequences on the disc. The interpretation of the
Volume Recognition Sequence is otherwise unchanged. </p>
<p>For example, consider a disc that contains 3 sessions, where
session 1 starts at 00:00:00, session 2 starts at 10:00:00, and
session 3 starts at 20:00:00. The Volume Recognition Sequence for
this disc would start at Minute:Second:Frame address 20:00:16. </p>
<p>This technique is compatible with the CD-Bridge multisession
technique. </p>
<p><b>Track Modes and Sector Forms</b> </p>
<p>The data area for a Joliet volume on a CD-ROM shall be
comprised of either Mode 1 or Mode 2 Form 1 sectors. CD-ROM media
utilizing the multisession recording techniques outlined above
may not contain any Mode 1 sectors anywhere on the media. Mode 1
sectors are allowed only on single-session media. </p>
<p>Mode 2 Form 2 sectors and CD-Digital Audio tracks may be
recorded on the same media as a Joliet volume. In this case, the
CD-XA extensions to Joliet may be utilized to identify Mode 2
Form 2 extents and CD-Digital Audio extents. </p>
<p>CD-Digital Audio tracks may not be recorded in sessions 2 and
higher. If any CD-Digital Audio tracks are recorded, all the
CD-Digital Audio tracks shall be recorded in the first session. </p>
<h3><a name="_Toc305607052"></a><a name="cdxa"></a>CD-XA
Extensions to Joliet </h3>
<p>CD-ROM discs utilizing the Joliet extensions to ISO 9660 and
which also identify mode 2 form 2 extents or CD-Digital Audio
extents shall be marked with a CD-ROM XA Label as specified in
&quot;System Description CD-XA&quot; section 2.1. </p>
<p>The CD-ROM XA Label shall be located at offset 1024 (byte
position 1025) in the Joliet Supplementary Volume Descriptor. The
identifying signature 'CD-XA001' shall be recorded starting at
offset 1024 in the Joliet Supplementary Volume Descriptor. This
identifying signature is equivalent to the hex bytes
(43)(44)(2D)(58)(41)(30)(30)(31). </p>
<p>Mode 2 form 2 extents shall be identified using recording
rules outlined in &quot;System Description CD-XA&quot;, section
2.7. In this case, bit 12 of the Attributes field of the &quot;XA
System Use Information&quot; shall be set to one to identify that
the file contains mode 2 form 2 sectors. See below for additional
information regarding Data Length. </p>
<p>CD-Digital Audio extents shall be identified using recording
rules outlined in &quot;System Description CD-XA&quot;, section
2.7. In this case, bit 14 of the Attributes field of the &quot;XA
System Use Information&quot; shall be set to one to identify that
the file is comprised of an extent of CD-Digital Audio. See below
for additional information regarding Data Length. </p>
<p>If a file is marked such that either bit 12 is set to one or
bit 14 is set to one in the Attributes field of the &quot;XA
System Use Information&quot;, then the Data Length field of the
Directory Record shall be set to 2048 times the number of sectors
contained in the extent. </p>
<p>See ISO 9660:1988 section 9.1.4. </p>
<h3><a name="_Toc305607053"></a><a name="other"></a>Other
Extensions to Joliet </h3>
<p>The Joliet Extensions to ISO 9660 are designed to coexist with
other extensions such as the &quot;System Use Sharing
Protocol&quot; and &quot;RockRidge Interchange Protocol&quot;.
However, these protocols are not an integral part of the Joliet
specification. </p>
<p>The method used to integrate these other protocols into Joliet
is not defined here. </p>
<p><a name="bibliography"></a><a href="#contents">return to the
table of contents</a> </p>
<h2>Bibliography </h2>
<p><u>ISO 2022 - <i>Information processing </i>- ISO 7-bit and
8-bit coded character sets - Code extension techniques</u>,
International Organization for Standardization, </p>
<p><u>ISO 9660 - <i>Information processing </i>- Volume and file
structure of CD-ROM for information interchange</u>,
International Organization for Standardization, 1988-04-15 </p>
<p><u>ISO 10149 : 1989 (E) - <i>Information technology</i> - Data
interchange on read-only 120mm optical data discs (CD-ROM)
&quot;YellowBook&quot;, </u>International Organization for
Standardization, 1989-09-01 </p>
<p><u>ISO 10646 - Information technology - Universal
Multiple-Octet Coded Character Sets (UCS)</u>, International
Organization for Standardization, </p>
<p><u>The Unicode Standard - <i>Worldwide Character Encoding </i>Version
1.0,</u> The Unicode Consortium, Addison-Wesley Publishing
Company, Inc, 1990-1991 Unicode, Inc., Volume 1 </p>
<p><u>Orangebook</u>, N. V. Philips and Sony Corporation,
November 1990 </p>
<p><u>System Description CD-XA, </u>N. V. Philips and Sony
Corporation, March 1991 </p>
<p><u>System Use Sharing Protocol</u> </p>
<p><u>RockRidge Interchange Protocol</u> </p>
<p>
<hr>
<p><b>Copyright &#169; 1995 Microsoft Corporation unless
otherwise specified. All Rights Reserved.<br>
</b> </center> </p>
</body>
</html>

View File

@@ -0,0 +1,140 @@
Date: Sat, 29 Jun 1996 18:59:41 -0500
From: Robert Vandervelde <RVand@SNOWHILL.COM>
Subject: Re: Long Filename Structure (Windows 95).
>Does anybody know how Windows 95 implemented long filenames?
>I used Norton Disk Editor to read the directory and found some encrypted
>form of the long filenames.
>I plan to write a utility that changes the long filenames only.
>
>Thank you,
>;-b Sintar Wirawan at Menur 30 Surabaya - Indonesia
>8-@ squid@sby.mega.net.id
>
>
How Windows 95 Stores Long Filenames
Copyright notice: Taken from PC Magazine, June 25, 1996 by Jeff Prosise
"Windows 95 stores short filenames the same way DOS and 16-bit windows so.
Every file on every disk is accompanied by a 32-byte directory entry that
records the name of the file as well as the file's attributes, a date and
time stamp, and other information."
The format of the short directory entry is as follows:
Offset Description Size
0 Filename 8 bytes (ASCII)
8 Filename extension 3 bytes (ASCII)
11 File attributes 1 byte (encoded)
12 reserved 10 bytes
22 Time stamp 2 bytes (encoded)
24 Date stamp 2 bytes (encoded)
26 Starting cluster 2 bytes
28 File size 4 bytes
File attributes byte
7: reserved 3: Volume label
6: reserved 2: System
5: archive 1: Hidden
4: subdirectory 0: Read-only
Time stamp byte
11-15: Hours (0-23)
5-10: Minutes (0-59)
0- 4: Seconds divided by 2 (0-29)
Date stamp byte
11-15: Year (relative to 1980)
5- 8: Month (1-12)
0- 4: Day of month (0-31)
"Because of compatibility issues, adding long filename support to an
operating system that uses 8.3 filenames isn't as expanding directory
entries to hold more than 11 characters. ...
Windows 95's designers devised a clever solution to the problem of
supporting long filenames while preserving compatability with previous
versions of DOS and Windows applications. ... Through testing, Microsoft
found that if a driectory entry is marked with an "impossible" combination
of read-only, hidden, system, and volume label attribute bits - that is,
if the directory entry's attribute byte holds the value 0Fh - the
enumeration functions built inot all existing versions of DOS and pre-95
versions of Windows will skip over that directory entry as if it weren't
there.
The solution for Windows 95, then, was to store two names for every file and
subdirectory: a short name that's visible to all applications and a long
name that's visible only to Windows 95 applications...Short filenames are
stored in 8.3 format in conventionl 32-byte directory entries. Windows
creates a short filename from a long one by truncating it to six uppercase
characters and adding "~1" to the end of the base filename. If there's
already another filename with the same first six characters, the number is
incremented. The extension is kept the same, and any character that was
illegal in earlier versions of Windows and DOS is replaced with an
underscore.
Long filenames are stored in specially formatted 32-byte long filename (LFN)
directory entries marked with attribute bytes set to 0Fh. For a given
file or subdirectory, a group of one or more LFN directory entries
immediately precedes the single 8.3 directory entry on the disk. Each LFN
directory entry contains up to 13 characters of the long filename, and the
OS strings together as many as needed to comprise an entire long filename.
Filenames are stored in Unicode format, which requires 2 bytes per character
as opposed to ASCII's 1 byte. Filename characters are spread among three
separate fields: the first 10 bytes (five characters) in length, the second
12 bytes (6 characters), and the third 4 bytes (two characters). The lowest
five bits of the directory entry's first byte hold a sequence number that
identifies the directory entry's position relative to other LFN directory
entries associated with the same file. If a long filename requires three
LFN directory entries, for example, the sequence number of the first will
be 1, that of the second will be 2, and the sequence of the third will be
3. Bit 6 of the third entry's first byte is set to 1 to indicate that it's
the last entry in the sequence.
The attribute field appears at the same location in LFN directory entries
as in 8.3 directory entries. ... The starting cluster number field also
appears at the same location, but in LFN directory entries its value is
always 0. The type indicator field also holds 0 in every long filename I've
examined, but Adrian King's Inside Windows 95 (Microsoft Press, 1994) says
it can also hold a nonzero value indicating that the directory entry
contains "class information" for the corresponding file. ... The LFN
directory entry's checksum byte holds an eight-bit checksum value computed
by adding certain fields of the 8.3 directory entry and performing a
modulo 256 operation on the result. Windows 95 uses this checksum to detect
orphaned or corrupted LFN directory entries.
Long filename directory entry
OFFSET DESCRIPTION Size
0 Sequence byte 1 byte
1 First five characters of LFN 10 bytes
11 File attributes 1 byte
12 Type indicator 1 byte (always 0??)
13 Checksum 1 byte
14 Next six characters of LFN 12 bytes
26 Starting cluster number 2 bytes (always 0)
28 Next two characters of LFN 4 bytes
*NOTE: The above structure may span up to 31 entries. The last entry will
be a standard 8.3 filename directory structure.
Sequence byte
7: apparently unused (always 0)
6: 1=final component of this LFN
5: apparently unused (always 0)
0-4: sequence number (1-31)
--------------------------------------------------------
Robert Vandervelde + ...that what we have learned and
RVand@snowhill.com + truly understood, we discovered
Enterprise, AL + ourselves.
The Wiregrass + - Richard C. Dorf
--------------------------------------------------------

View File

@@ -0,0 +1,41 @@
From: noesis@ucscb.UCSC.EDU (Kyle Anthony York)
Newsgroups: comp.os.msdos.programmer
Subject: Re: Win95 FAT long file name storage?
ok, here goes:
long file names are stored as follows, i've been using direct sector
access, so i don't know if findfirst(..) findnext(..) will work.
the long names are stored as unicode strings in the immediatly preceding
entries. the entry attribute byte is 0x0f.
the long name format is also used whenever a filename has lower case
characters, thus preserving case and backwards compatability.
so...if the name is ``abcdefghijklmnop'' and this is the first entry of a
subdirectory:
entry[0] = "."
entry[1] = ".."
entry[2] = "ijklmnop", attribute = 0x0f
entry[3] = "abcdefgh", attribute = 0x0f
entry[4] = "ABCDEF~1", attribute = normal
in addition to having the attribute 0x0f, the entry format is:
BYTE 0 --- bit 7 = 1 if deleted, 0 if not
6 = 1 if last block of extended entry, 0 if not
5..0 = extended entry # (1..31)
BYTE 1..10 --- first 5 characters in unicode ("abcde" becomes
"a", 0, "b", 0, "c", 0, "d", 0, "e", 0
BYTE 11 --- attribute (0x0f)
BYTE 12 --- ?? unknown ??
BYTE 13 --- ?? unknown ??
BYTE 14..25 -- next 6 characters (in unicode)
BYTE 26..27 --- 0x0000 (first cluster #)
BYTE 28..31 --- last 4 characters (in unicode)
unused bytes are set to 0xff
end of string is denoted by 0x00, 0x00
best o' luck
--kyle

View File

@@ -0,0 +1,107 @@
LONG FILENAMES
How does Windows 95 stores LONG FILENAMES?
This file was worked out by Jozsef Hidasi
Hidasi.Jozsef@MTTBBS.hu
<EFBFBD>-- [ Contact Info > ] --------------------------------------------------------<2D>
If you realize any mistakes, please contact me and let me know, to correct it!
Thanks for everybody who helps to make this dox up to date!
Don's hesitate contact me!
Jozsef Hidasi
E-Mail: Hidasi.Jozsef@MTTBBS.hu
FIDO: 2:371/4.13 (At the moment this is my BBS :-) You can write to SysOp?!
<EFBFBD>-- [ WARNING! > ] ------------------------------------------------------------<2D>
This text contains the most info I know at the moment! I'm not responsible for
any DATA LOST!
"???" Means I don't know what that field means...
<EFBFBD>-- [ What this doxument about? > ] -------------------------------------------<2D>
This document contains some info how Windows'95 stores the long filenames.
I don't know How long filenames can be handled by windows but as I calculated
a file entry can be 832 bytes long. (See below)
Windows uses a simple methold to hide a file from DOS, it changes the "file"'s
attribute to VolumeLabel. Basicly a disk can have only one VolumeLabel, and
this attrib is not used any more as other files! In this way we can make
difference between DOS File Rec. (I won't describe it now) and Windows Record.
Eighter Dos File and Windows Record are 32 bytes long. (DOS file Record is the
main file descriptor, date/time/attrib/etc...)
Windows Record>
OFFSET Count Type Description Remark
------------------------------------------------------------------
0000h 1 byte Counter -
0001h 10 char FileName E1 Entry 1
000Bh 1 byte Attrib Always 0Fh
000Ch 2 word ??? 0
000Eh 12 char FileName E2 Entry 2
001Ah 2 word ??? 0
001Ch 4 char FileName E3 Entry 3
Counter:
If attrib=0Fh and the counter>64 then Windows Entries will follow:
Entry no.: Counter-'@'
Filename: The FileName is cut in 3 parts... Because of DOS compatibility...
???: Reserved or don't know...
Simple Example:
Sector 19 ; Don't laught! This is a simple floppy :-)
This is a simple DOS filenamed file>
00000000: 53 49 4D 50 4C 45 20 20 - 44 4F 53 20 00 03 B8 9D SIMPLE DOS .
00000010: 1F 25 1F 25 00 00 B9 9D - 1F 25 00 00 00 00 00 00 %%..<2E><>%......
This is the first entry of the new Long filenamed file>
(I've created this first and renamed by Windows)
(This file is errased because of the filename's first byte is 0E5h)
00000020: E5 49 4D 50 4C 45 20 20 - 57 49 4E 20 00 2A C6 9D <20>IMPLE WIN .*Ɲ
00000030: 1F 25 1F 25 00 00 C7 9D - 1F 25 00 00 00 00 00 00 %%..ǝ%......
This is the first windows entry.
The Filename's first byte (Counter) is C so 4 entryes will follow ...
(One entry can hold 13 characters of the Long Filename...)
This entry holds "m e d F i l e "=Filename E1+Filename E2+Filename E3
(See bellow)
This means that the first entry holds the last characters of the long
filename...
00000040: 43 6D 00 65 00 64 00 20 - 00 46 00 0F 00 44 69 00 Cm.e.d. .F..Di.
00000050: 6C 00 65 00 00 00 FF FF - FF FF 00 00 FF FF FF FF l.e...<2E><><EFBFBD><EFBFBD>..<2E><><EFBFBD><EFBFBD>
Here is the next entry>
Counter=2 means this is the 2nd entry of 3...
Holds: " a L o n g F i l e n a"
00000060: 02 20 00 61 00 20 00 4C - 00 6F 00 0F 00 44 6E 00 .a. .L.o..Dn.
00000070: 64 00 46 00 69 00 6C 00 - 65 00 00 00 6E 00 61 00 g.F.i.l.e...n.a.
Here is the next entry>
Counter=1 means this is the 1nd entry of 3...
Holds: "S i m p l e . w i n i s "
00000080: 01 53 00 69 00 6D 00 70 - 00 6C 00 0F 00 44 65 S.i.m.p.l..De.
00000090: 2E 00 77 00 69 00 6E 00 - 20 00 00 00 69 00 73 00 ..w.i.n. ...i.s.
Here is the next entry>
Counter="0" means this is the 1nd entry of 3...
This entry holds the file's genereal info like date/time/attrib/length/etc...
and DOS filename...
000000A0: 53 49 4D 50 4C 45 7E 31 - 57 49 4E 20 00 2A C6 9D SIMPLE~1WIN .*Ɲ
000000B0: 1F 25 1F 25 00 00 C7 9D - 1F 25 00 00 00 00 00 00 %%..ǝ%......
--------------------------
Summary>
DOS Filename: simple~1.win
Long Filename: Simple.win is a LongFilenamed file...
* A chracter ha a 0 after all characters, i don't know why!
The followings are empty:
000000C0: 00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00 ................
000000D0: 00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00 ................
Well, that's all i can write now, but it's quite hard to explain how does
this work! Write me a letter instead... :-)
<EFBFBD>-- [ End of document > ] -----------------------------------------------------<2D>
Best wishes,
Hidi...

Binary file not shown.

View File

@@ -0,0 +1,392 @@
NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
----------------------------------------------------------------------
(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu>
and lightly annotated by Gordon Chaffee).
This document presents a very rough, technical overview of my
knowledge of the extended FAT file system used in Windows NT 3.5 and
Windows 95. I don't guarantee that any of the following is correct,
but it appears to be so.
The extended FAT file system is almost identical to the FAT
file system used in DOS versions up to and including 6.223410239847
:-). The significant change has been the addition of long file names.
Theses names support up to 255 characters including spaces and lower
case characters as opposed to the traditional 8.3 short names.
Here is the description of the traditional FAT entry in the current
Windows 95 filesystem:
struct directory { // Short 8.3 names
unsigned char name[8]; // file name
unsigned char ext[3]; // file extension
unsigned char attr; // attribute byte
unsigned char lcase; // Case for base and extension
unsigned char ctime_ms; // Creation time, milliseconds
unsigned char ctime[2]; // Creation time
unsigned char cdate[2]; // Creation date
unsigned char adate[2]; // Last access date
unsigned char reserved[2]; // reserved values (ignored)
unsigned char time[2]; // time stamp
unsigned char date[2]; // date stamp
unsigned char start[2]; // starting cluster number
unsigned char size[4]; // size of the file
};
The lcase field specifies if the base and/or the extension of an 8.3
name should be capitalized. This field does not seem to be used by
Windows 95 but it is used by Windows NT. The case of filenames is not
completely compatible from Windows NT to Windows 95. It is not completely
compatible in the reverse direction, however. Filenames that fit in
the 8.3 namespace and are written on Windows NT to be lowercase will
show up as uppercase on Windows 95.
Note that the "start" and "size" values are actually little
endian integer values. The descriptions of the fields in this
structure are public knowledge and can be found elsewhere.
With the extended FAT system, Microsoft has inserted extra
directory entries for any files with extended names. (Any name which
legally fits within the old 8.3 encoding scheme does not have extra
entries.) I call these extra entries slots. Basically, a slot is a
specially formatted directory entry which holds up to 13 characters of
a files extended name. Think of slots as additional labeling for the
directory entry of the file to which they correspond. Microsoft
prefers to refer to the 8.3 entry for a file as its alias and the
extended slot directory entries as the file name.
The C structure for a slot directory entry follows:
struct slot { // Up to 13 characters of a long name
unsigned char id; // sequence number for slot
unsigned char name0_4[10]; // first 5 characters in name
unsigned char attr; // attribute byte
unsigned char reserved; // always 0
unsigned char alias_checksum; // checksum for 8.3 alias
unsigned char name5_10[12]; // 6 more characters in name
unsigned char start[2]; // starting cluster number
unsigned char name11_12[4]; // last 2 characters in name
};
If the layout of the slots looks a little odd, it's only
because of Microsoft's efforts to maintain compatibility with old
software. The slots must be disguised to prevent old software from
panicing. To this end, a number of measures are taken:
1) The attribute byte for a slot directory entry is always set
to 0x0f. This corresponds to an old directory entry with
attributes of "hidden", "system", "read-only", and "volume
label". Most old software will ignore any directory
entries with the "volume label" bit set. Real volume label
entries don't have the other three bits set.
2) The starting cluster is always set to 0, an impossible
value for a DOS file.
Because the extended FAT system is backward compatible, it is
possible for old software to modify directory entries. Measures must
be taken to insure the validity of slots. An extended FAT system can
verify that a slot does in fact belong to an 8.3 directory entry by
the following:
1) Positioning. Slots for a file always immediately proceed
their corresponding 8.3 directory entry. In addition, each
slot has an id which marks its order in the extended file
name. Here is a very abbreviated view of an 8.3 directory
entry and its corresponding long name slots for the file
"My Big File.Extension which is long":
<proceeding files...>
<slot #3, id = 0x43, characters = "h is long">
<slot #2, id = 0x02, characters = "xtension whic">
<slot #1, id = 0x01, characters = "My Big File.E">
<directory entry, name = "MYBIGFIL.EXT">
Note that the slots are stored from last to first. Slots
are numbered from 1 to N. The Nth slot is or'ed with 0x40
to mark it as the last one.
2) Checksum. Each slot has an "alias_checksum" value. The
checksum is calculated from the 8.3 name using the
following algorithm:
for (sum = i = 0; i < 11; i++) {
sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
}
3) If there is in the final slot, a Unicode NULL (0x0000) is stored
after the final character. After that, all unused characters in
the final slot are set to Unicode 0xFFFF.
Finally, note that the extended name is stored in Unicode. Each Unicode
character takes two bytes.
NOTES ON UNICODE TRANSLATION IN VFAT FILESYSTEM
----------------------------------------------------------------------
(Information provided by Steve Searle <steve@mgu.bath.ac.uk>)
Char used as Char(s) used Char(s) used in Entries which have
filename in shortname longname slot been corrected
0x80 (128) 0x80 0xC7
0x81 (129) 0x9A 0xFC
0x82 (130) 0x90 0xE9 E
0x83 (131) 0xB6 0xE2 E
0x84 (132) 0x8E 0xE4 E
0x85 (133) 0xB7 0xE0 E
0x86 (134) 0x8F 0xE5 E
0x87 (135) 0x80 0xE7 E
0x88 (136) 0xD2 0xEA E
0x89 (137) 0xD3 0xEB E
0x8A (138) 0xD4 0xE8 E
0x8B (139) 0xD8 0xEF E
0x8C (140) 0xD7 0xEE E
0x8D (141) 0xDE 0xEC E
0x8E (142) 0x8E 0xC4 E
0x8F (143) 0x8F 0xC5 E
0x90 (144) 0x90 0xC9 E
0x91 (145) 0x92 0xE6 E
0x92 (146) 0x92 0xC6 E
0x93 (147) 0xE2 0xF4 E
0x94 (148) 0x99 0xF6
0x95 (149) 0xE3 0xF2
0x96 (150) 0xEA 0xFB
0x97 (151) 0xEB 0xF9
0x98 (152) "_~1" 0xFF
0x99 (153) 0x99 0xD6
0x9A (154) 0x9A 0xDC
0x9B (155) 0x9D 0xF8
0x9C (156) 0x9C 0xA3
0x9D (157) 0x9D 0xD8
0x9E (158) 0x9E 0xD7
0x9F (159) 0x9F 0x92
0xA0 (160) 0xB5 0xE1
0xA1 (161) 0xD6 0xE0
0xA2 (162) 0xE0 0xF3
0xA3 (163) 0xE9 0xFA
0xA4 (164) 0xA5 0xF1
0xA5 (165) 0xA5 0xD1
0xA6 (166) 0xA6 0xAA
0xA7 (167) 0xA7 0xBA
0xA8 (168) 0xA8 0xBF
0xA9 (169) 0xA9 0xAE
0xAA (170) 0xAA 0xAC
0xAB (171) 0xAB 0xBD
0xAC (172) 0xAC 0xBC
0xAD (173) 0xAD 0xA1
0xAE (174) 0xAE 0xAB
0xAF (175) 0xAF 0xBB
0xB0 (176) 0xB0 0x91 0x25
0xB1 (177) 0xB1 0x92 0x25
0xB2 (178) 0xB2 0x93 0x25
0xB3 (179) 0xB3 0x02 0x25
0xB4 (180) 0xB4 0x24 0x25
0xB5 (181) 0xB5 0xC1
0xB6 (182) 0xB6 0xC2
0xB7 (183) 0xB7 0xC0
0xB8 (184) 0xB8 0xA9
0xB9 (185) 0xB9 0x63 0x25
0xBA (186) 0xBA 0x51 0x25
0xBB (187) 0xBB 0x57 0x25
0xBC (188) 0xBC 0x5D 0x25
0xBD (189) 0xBD 0xA2
0xBE (190) 0xBE 0xA5
0xBF (191) 0xBF 0x10 0x25
0xC0 (192) 0xC0 0x14 0x25
0xC1 (193) 0xC1 0x34 0x25
0xC2 (194) 0xC2 0x2C 0x25
0xC3 (195) 0xC3 0x1C 0x25
0xC4 (196) 0xC4 0x00 0x25
0xC5 (197) 0xC5 0x3C 0x25
0xC6 (198) 0xC7 0xE3 E
0xC7 (199) 0xC7 0xC3
0xC8 (200) 0xC8 0x5A 0x25 E
0xC9 (201) 0xC9 0x54 0x25 E
0xCA (202) 0xCA 0x69 0x25 E
0xCB (203) 0xCB 0x66 0x25 E
0xCC (204) 0xCC 0x60 0x25 E
0xCD (205) 0xCD 0x50 0x25 E
0xCE (206) 0xCE 0x6C 0x25 E
0xCF (207) 0xCF 0xA4 E
0xD0 (208) 0xD1 0xF0
0xD1 (209) 0xD1 0xD0
0xD2 (210) 0xD2 0xCA
0xD3 (211) 0xD3 0xCB
0xD4 (212) 0xD4 0xC8
0xD5 (213) 0x49 0x31 0x01
0xD6 (214) 0xD6 0xCD
0xD7 (215) 0xD7 0xCE
0xD8 (216) 0xD8 0xCF
0xD9 (217) 0xD9 0x18 0x25
0xDA (218) 0xDA 0x0C 0x25
0xDB (219) 0xDB 0x88 0x25
0xDC (220) 0xDC 0x84 0x25
0xDD (221) 0xDD 0xA6
0xDE (222) 0xDE 0xCC
0xDF (223) 0xDF 0x80 0x25
0xE0 (224) 0xE0 0xD3
0xE1 (225) 0xE1 0xDF
0xE2 (226) 0xE2 0xD4
0xE3 (227) 0xE3 0xD2
0xE4 (228) 0x05 0xF5
0xE5 (229) 0x05 0xD5
0xE6 (230) 0xE6 0xB5
0xE7 (231) 0xE8 0xFE
0xE8 (232) 0xE8 0xDE
0xE9 (233) 0xE9 0xDA
0xEA (234) 0xEA 0xDB
0xEB (235) 0xEB 0xD9
0xEC (236) 0xED 0xFD
0xED (237) 0xED 0xDD
0xEE (238) 0xEE 0xAF
0xEF (239) 0xEF 0xB4
0xF0 (240) 0xF0 0xAD
0xF1 (241) 0xF1 0xB1
0xF2 (242) 0xF2 0x17 0x20
0xF3 (243) 0xF3 0xBE
0xF4 (244) 0xF4 0xB6
0xF5 (245) 0xF5 0xA7
0xF6 (246) 0xF6 0xF7
0xF7 (247) 0xF7 0xB8
0xF8 (248) 0xF8 0xB0
0xF9 (249) 0xF9 0xA8
0xFA (250) 0xFA 0xB7
0xFB (251) 0xFB 0xB9
0xFC (252) 0xFC 0xB3
0xFD (253) 0xFD 0xB2
0xFE (254) 0xFE 0xA0 0x25
0xFF (255) 0xFF 0xA0
Page 0
0x80 (128) 0x00
0x81 (129) 0x00
0x82 (130) 0x00
0x83 (131) 0x00
0x84 (132) 0x00
0x85 (133) 0x00
0x86 (134) 0x00
0x87 (135) 0x00
0x88 (136) 0x00
0x89 (137) 0x00
0x8A (138) 0x00
0x8B (139) 0x00
0x8C (140) 0x00
0x8D (141) 0x00
0x8E (142) 0x00
0x8F (143) 0x00
0x90 (144) 0x00
0x91 (145) 0x00
0x92 (146) 0x00
0x93 (147) 0x00
0x94 (148) 0x00
0x95 (149) 0x00
0x96 (150) 0x00
0x97 (151) 0x00
0x98 (152) 0x00
0x99 (153) 0x00
0x9A (154) 0x00
0x9B (155) 0x00
0x9C (156) 0x00
0x9D (157) 0x00
0x9E (158) 0x00
0x9F (159) 0x92
0xA0 (160) 0xFF
0xA1 (161) 0xAD
0xA2 (162) 0xBD
0xA3 (163) 0x9C
0xA4 (164) 0xCF
0xA5 (165) 0xBE
0xA6 (166) 0xDD
0xA7 (167) 0xF5
0xA8 (168) 0xF9
0xA9 (169) 0xB8
0xAA (170) 0x00
0xAB (171) 0xAE
0xAC (172) 0xAA
0xAD (173) 0xF0
0xAE (174) 0x00
0xAF (175) 0xEE
0xB0 (176) 0xF8
0xB1 (177) 0xF1
0xB2 (178) 0xFD
0xB3 (179) 0xFC
0xB4 (180) 0xEF
0xB5 (181) 0xE6
0xB6 (182) 0xF4
0xB7 (183) 0xFA
0xB8 (184) 0xF7
0xB9 (185) 0xFB
0xBA (186) 0x00
0xBB (187) 0xAF
0xBC (188) 0xAC
0xBD (189) 0xAB
0xBE (190) 0xF3
0xBF (191) 0x00
0xC0 (192) 0xB7
0xC1 (193) 0xB5
0xC2 (194) 0xB6
0xC3 (195) 0xC7
0xC4 (196) 0x8E
0xC5 (197) 0x8F
0xC6 (198) 0x92
0xC7 (199) 0x80
0xC8 (200) 0xD4
0xC9 (201) 0x90
0xCA (202) 0xD2
0xCB (203) 0xD3
0xCC (204) 0xDE
0xCD (205) 0xD6
0xCE (206) 0xD7
0xCF (207) 0xD8
0xD0 (208) 0x00
0xD1 (209) 0xA5
0xD2 (210) 0xE3
0xD3 (211) 0xE0
0xD4 (212) 0xE2
0xD5 (213) 0xE5
0xD6 (214) 0x99
0xD7 (215) 0x9E
0xD8 (216) 0x9D
0xD9 (217) 0xEB
0xDA (218) 0xE9
0xDB (219) 0xEA
0xDC (220) 0x9A
0xDD (221) 0xED
0xDE (222) 0xE8
0xDF (223) 0xE1
0xE0 (224) 0x85, 0xA1
0xE1 (225) 0xA0
0xE2 (226) 0x83
0xE3 (227) 0xC6
0xE4 (228) 0x84
0xE5 (229) 0x86
0xE6 (230) 0x91
0xE7 (231) 0x87
0xE8 (232) 0x8A
0xE9 (233) 0x82
0xEA (234) 0x88
0xEB (235) 0x89
0xEC (236) 0x8D
0xED (237) 0x00
0xEE (238) 0x8C
0xEF (239) 0x8B
0xF0 (240) 0xD0
0xF1 (241) 0xA4
0xF2 (242) 0x95
0xF3 (243) 0xA2
0xF4 (244) 0x93
0xF5 (245) 0xE4
0xF6 (246) 0x94
0xF7 (247) 0xF6
0xF8 (248) 0x9B
0xF9 (249) 0x97
0xFA (250) 0xA3
0xFB (251) 0x96
0xFC (252) 0x81
0xFD (253) 0xEC
0xFE (254) 0xE7
0xFF (255) 0x98

View File

@@ -0,0 +1,162 @@
<html>
<head><title>The BFS filesystem structure</title></head>
<body>
<center><h1>The BFS filesystem structure</h1></center>
The UnixWare Boot FileSystem (BFS) is a filesystem used in SCO UnixWare.
It contains all files necessary for UnixWare boot procedures (such as
<tt>unix</tt>).
Because the object of the bfs filesystem type is to allow quick and
simple booting, BFS was designed as a contiguous flat filesystem. It
is not intended to support general users. The only directory bfs
supports is the root directory. Users can create only regular files;
no directories or special files can be created in the bfs filesystem.<p>
A BFS filesystem consists of three parts:
<ul>
<li> Superblock
<li> Inodes
<li> Data area
</ul>
Each block on disk is 512 bytes long, blocks are numbered from zero. Most
data structures use "offset from begining of disk". Divide this number to
get block number.<p>
<b>NOTE:</b> Operations on a BFS filesystem in SCO UnixWare severely limited.
For example, it is not possible to have two files open for writing
simultaneously. These restrictions do not
apply to operations involving only the reading of files.<p>
You can read a BFS filesystem from your Linux box. See
<A href="http://www.penguin.cz/~mhi/fs/bfs/">BFS Linux module home page</a>.
<p>
<h2>The BFS superblock</h2>
The superblock is at the begining of disk, block 0.
<table border=1>
<tr><th>Type
<th>Name
<th>Description
<tr><td>32bit int
<td>magic
<td>Magic number (0x1BADFACE)
<tr><td>32bit int
<td>start
<td>Start of data blocks (in bytes)
<tr><td>32bit int
<td>size
<td>Size of filesystem (in bytes)
<tr><td>4x 32bit int
<td>sanity words
<td>Sanity words are used to recover filesystem after interrupted
<A href="#compaction">compaction</a>. They are usually 0xFFFFFFFF.
</table>
<h2>BFS inodes</h2>
The inode contains all the information about a file except its name.
Filenames are kept in the root directory, the only directory in the
BFS filesystem. An inode is 64 bytes long. Inode table starts at
block number 1 and fills the space between superblock and first data
block (usually root directory). First inode has number 2.
<table border=1>
<tr><th>Type
<th>Name
<th>Description
<tr><td>32bit int
<td>inode number
<td>Inode number, often contains "garbage" in high 16 bits.
<tr><td>32bit int
<td>first block
<td>First block of file. Next block is n+1, n+2, ... n+x.
<tr><td>32bit int
<td>Last block
<td>Last block of file
<tr><td>32bit int
<td>offset to eof
<td>Disk offset to end of file (in bytes)
<tr><td>32bit int
<td>Attributes
<td>File attributes (1 = regular file, 2 = directory)
<tr><td>32bit int
<td>mode
<td>File mode, rwxrwxrwx (only low 9 bits used)
<tr><td>32bit int
<td>uid
<td>File owner - user id
<tr><td>32bit int
<td>gid
<td>File owner - group id
<tr><td>32bit int
<td>nlinks
<td>Hard link count
<tr><td>32bit int
<td>atime
<td>Access time
<tr><td>32bit int
<td>mtime
<td>Modify time
<tr><td>32bit int
<td>ctime
<td>Create time
<tr><td>4x 32bit int
<td>spare
<td>Unused, should be zero
</table>
The number of inodes is defined when mkfs is used to create the filesystem.
<p>
<h2>BFS storage blocks</h2>
The remainder of the space allocated to the filesystem is taken up by
data blocks. The storage blocks store the root directory and
the regular files. For a regular file, the storage blocks contain the
contents of the file. For the root directory, the storage blocks
contain 16-byte entries.
<table border=1>
<tr><th>Type
<th>Name
<th>Description
<tr><td>16bit int
<td>inode
<td>File inode number
<tr><td>14 8bit characters
<td>name
<td>File name
</table>
The root directory *MUST* begin with two entries "." and "..", both with
inode number 2 (root directory).
<p>
<h2>Managing BFS data blocks</h2>
The data or storage blocks for a file are allocated contiguously. The
data block after the last data block used in the filesystem is
considered the next data block available to store a file. When a file
is deleted, its data blocks are released.<p>
<A name="compaction"><h2>Compaction</h2></a>
Compaction is a way of recovering data blocks by shifting files until
the gaps left behind by deleted files are eliminated. This operation
can be expensive, but it is necessary because of the method used by
BFS to store and delete files.
You need to perform compaction when either of the following situations occurs:
<ul>
<li> The system has reached the end of the filesystem, and there are
still free blocks available.
<li> The system deletes a large file and the file after it on disk is
small and is the last file in the filesystem. (Small files are
files of no more than ten blocks; large files are files of 500 or
more blocks.)
</ul>
<h2>Related links</h2>
<A href="http://www.penguin.cz/~mhi/fs/bfs/">BFS Linux module</a><br>
<A href="http://www.sco.com/">SCO homepage</a><br>
<A href="http://www.penguin.cz/~mhi/fs/">Filesystems HOWTO</a><br>
<hr>
<center><i>Copyright (c) 1999 Martin Hinner,
<A href="mailto:mhi@penguin.cz">mhi@penguin.cz</a></i></center>
</body>

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.7 KiB

View File

@@ -0,0 +1,901 @@
<!-- X-URL: http://www.mit.edu/~tytso/linux/ext2intro.html -->
<HTML>
<HEAD>
<TITLE>Design and Implementation of the Second Extended Filesystem</TITLE>
</HEAD>
<BODY>
<P>This paper was first published in the Proceedings of the First Dutch
International Symposium on Linux, ISBN 90-367-0385-9.</P>
<HR>
<H2>Design and Implementation of the Second Extended Filesystem</H2>
<H4>R<EFBFBD>my Card, Laboratoire MASI--Institut Blaise Pascal,
E-Mail: card@masi.ibp.fr, and<BR>
Theodore Ts'o, Massachussets Institute of Technology,
E-Mail: tytso@mit.edu, and<BR>
Stephen Tweedie, University of Edinburgh,
E-Mail: sct@dcs.ed.ac.uk</H4>
<H3>Introduction</H3>
<P>Linux is a Unix-like operating system, which runs on PC-386
computers. It was implemented first as extension to the Minix
operating system <A href="#minix">[Tanenbaum 1987]</A> and its
first versions included support for the Minix filesystem only.
The Minix filesystem contains two serious limitations: block
addresses are stored in 16 bit integers, thus the maximal
filesystem size is restricted to 64 mega bytes, and directories
contain fixed-size entries and the maximal file name is 14
characters.
<P>We have designed and implemented two new filesystems that are
included in the standard Linux kernel. These filesystems,
called ``Extended File System'' (Ext fs) and ``Second Extended
File System'' (Ext2 fs) raise the limitations and add new
features.
<P>In this paper, we describe the history of Linux filesystems. We
briefly introduce the fundamental concepts implemented in Unix
filesystems. We present the implementation of the Virtual File
System layer in Linux and we detail the Second Extended File
System kernel code and user mode tools. Last, we present
performance measurements made on Linux and BSD filesystems and
we conclude with the current status of Ext2fs and the future
directions.
<H3>History of Linux filesystems</H3>
<P>In its very early days, Linux was cross-developed under the
Minix operating system. It was easier to share disks between
the two systems than to design a new filesystem, so Linus
Torvalds decided to implement support for the Minix filesystem
in Linux. The Minix filesystem was an efficient and relatively
bug-free piece of software.
<P>However, the restrictions in the design of the Minix
filesystem were too limiting, so people started thinking and
working on the implementation of new filesystems in Linux.
<P>In order to ease the addition of new filesystems into the
Linux kernel, a Virtual File System (VFS) layer was developed.
The VFS layer was initially written by Chris Provenzano, and
later rewritten by Linus Torvalds before it was integrated into
the Linux kernel. It is described in <A href="#section:vfs">The Virtual File System</A>.
<P>After the integration of the VFS in the kernel, a new
filesystem, called the ``Extended File System'' was implemented
in April 1992 and added to Linux 0.96c. This new filesystem
removed the two big Minix limitations: its maximal size was 2
giga bytes and the maximal file name size was 255 characters.
It was an improvement over the Minix filesystem but some
problems were still present in it. There was no support for the
separate access, inode modification, and data modification
timestamps. The filesystem used linked lists to keep track of
free blocks and inodes and this produced bad performances: as
the filesystem was used, the lists became unsorted and the
filesystem became fragmented.
<P>As a response to these problems, two new filesytems were
released in Alpha version in January 1993: the Xia filesystem
and the Second Extended File System. The Xia filesystem was
heavily based on the Minix filesystem kernel code and only
added a few improvements over this filesystem. Basically, it
provided long file names, support for bigger partitions and
support for the three timestamps. On the other hand, Ext2fs was
based on the Extfs code with many reorganizations and many
improvements. It had been designed with evolution in mind and
contained space for future improvements. It will be described
with more details in <A href="#section:ext2fs">The Second
Extended File System</A>
<P>When the two new filesystems were first released, they
provided essentially the same features. Due to its minimal
design, Xia fs was more stable than Ext2fs. As the filesystems
were used more widely, bugs were fixed in Ext2fs and lots of
improvements and new features were integrated. Ext2fs is now
very stable and has become the de-facto standard Linux
filesystem.
<P>This table contains a summary of the features
provided by the different filesystems:
<TABLE border>
<TR><TH></TH><TH>Minix FS</TH><TH>Ext FS</TH><TH>Ext2 FS</TH><TH>Xia FS</TD></TR>
<TR><TH>Max FS size</TH><TD>64 MB</TD><TD>2 GB</TD><TD>4 TB</TD><TD>2 GB</TD></TR>
<TR><TH>Max file size</TH><TD>64 MB</TD><TD>2 GB</TD><TD>2 GB</TD><TD>64 MB</TD></TR>
<TR><TH>Max file name</TH><TD>16/30 c</TD><TD>255 c</TD><TD>255 c</TD><TD>248 c</TD></TR>
<TR><TH>3 times support</TH><TD>No</TD><TD>No</TD><TD>Yes</TD><TD>Yes</TD></TR>
<TR><TH>Extensible</TH><TD>No</TD><TD>No</TD><TD>Yes</TD><TD>No</TD></TR>
<TR><TH>Var. block size</TH><TD>No</TD><TD>No</TD><TD>Yes</TD><TD>No</TD></TR>
<TR><TH>Maintained</TH><TD>Yes</TD><TD>No</TD><TD>Yes</TD><TD>?</TD></TR>
</TABLE>
<H3>Basic File System Concepts</H3>
<P>Every Linux filesystem implements a basic set of common
concepts derivated from the Unix operating system
<A href="#bach">[Bach 1986]</A> files are represented by inodes,
directories are simply files containing a list of entries and
devices can be accessed by requesting I/O on special files.
<H4>Inodes</H4>
<P>Each file is represented by a structure, called an inode.
Each inode contains the description of the file: file type,
access rights, owners, timestamps, size, pointers to data
blocks. The addresses of data blocks allocated to a file are
stored in its inode. When a user requests an I/O operation on
the file, the kernel code converts the current offset to a
block number, uses this number as an index in the block
addresses table and reads or writes the physical block. This
figure represents the structure of an inode:
<IMG SRC="ext2-inode.gif">
<H4>Directories</H4>
<P>Directories are structured in a hierarchical tree. Each
directory can contain files and subdirectories.
<P>Directories are implemented as a special type of files.
Actually, a directory is a file containing a list of entries.
Each entry contains an inode number and a file name. When a
process uses a pathname, the kernel code searchs in the
directories to find the corresponding inode number. After the
name has been converted to an inode number, the inode is loaded
into memory and is used by subsequent requests.
<P>This figure represents a directory:
<IMG SRC="ext2-dir.gif">
<H4>Links</H4>
<P>Unix filesystems implement the concept of link. Several
names can be associated with a inode. The inode contains a
field containing the number associated with the file. Adding a
link simply consists in creating a directory entry, where the
inode number points to the inode, and in incrementing the links
count in the inode. When a link is deleted, i.e. when one uses
the <TT>rm</TT> command to remove a filename, the kernel
decrements the links count and deallocates the inode if this
count becomes zero.
<P>This type of link is called a hard link and can only be used
within a single filesystem: it is impossible to create
cross-filesystem hard links. Moreover, hard links can only
point on files: a directory hard link cannot be created to
prevent the apparition of a cycle in the directory tree.
<P>Another kind of links exists in most Unix filesystems.
Symbolic links are simply files which contain a filename. When
the kernel encounters a symbolic link during a pathname to
inode conversion, it replaces the name of the link by its
contents, i.e. the name of the target file, and restarts the
pathname interpretation. Since a symbolic link does not point
to an inode, it is possible to create cross-filesystems
symbolic links. Symbolic links can point to any type of file,
even on nonexistent files. Symbolic links are very useful
because they don't have the limitations associated to hard
links. However, they use some disk space, allocated for their
inode and their data blocks, and cause an overhead in the
pathname to inode conversion because the kernel has to restart
the name interpretation when it encounters a symbolic link.
<H4>Device special files</H4>
<P>In Unix-like operating systems, devices can be accessed via
special files. A device special file does not use any space on
the filesystem. It is only an access point to the device
driver.
<P>Two types of special files exist: character and block
special files. The former allows I/O operations in character
mode while the later requires data to be written in block mode
via the buffer cache functions. When an I/O request is made on
a special file, it is forwarded to a (pseudo) device driver. A
special file is referenced by a major number, which identifies
the device type, and a minor number, which identifies the unit.
<A name="section:vfs">
<H3>The Virtual File System</H3>
</A>
<H4>Principle</H4>
<P>The Linux kernel contains a Virtual File System layer which
is used during system calls acting on files. The VFS is an
indirection layer which handles the file oriented system calls
and calls the necessary functions in the physical filesystem
code to do the I/O.
<P>This indirection mechanism is frequently used in Unix-like
operating systems to ease the integration and the use of
several filesystem types <A href="#vnodes">[Kleiman 1986,</A>
<A href="#lfs:unix">Seltzer <I>et al.</I> 1993]</A>.
<P>When a process issues a file oriented system call, the
kernel calls a function contained in the VFS. This function
handles the structure independent manipulations and redirects
the call to a function contained in the physical filesystem
code, which is responsible for handling the structure dependent
operations. Filesystem code uses the buffer cache functions to
request I/O on devices. This scheme is illustrated in this
figure:
<IMG SRC="ext2-vfs.gif">
<H4>The VFS structure</H4>
<P>The VFS defines a set of functions that every filesystem has
to implement. This interface is made up of a set of operations
associated to three kinds of objects: filesystems, inodes, and
open files.
<P>The VFS knows about filesystem types supported in the
kernel. It uses a table defined during the kernel
configuration. Each entry in this table describes a filesystem
type: it contains the name of the filesystem type and a pointer
on a function called during the mount operation. When a
filesystem is to be mounted, the appropriate mount function is
called. This function is responsible for reading the superblock
from the disk, initializing its internal variables, and
returning a mounted filesystem descriptor to the VFS. After the
filesystem is mounted, the VFS functions can use this
descriptor to access the physical filesystem routines.
<P>A mounted filesystem descriptor contains several kinds of
data: informations that are common to every filesystem types,
pointers to functions provided by the physical filesystem
kernel code, and private data maintained by the physical
filesystem code. The function pointers contained in the
filesystem descriptors allow the VFS to access the filesystem
internal routines.
<P>Two other types of descriptors are used by the VFS: an inode descriptor
and an open file descriptor. Each descriptor contains informations related to
files in use and a set of operations provided by the physical filesystem code.
While the inode descriptor contains pointers to functions that can be used to
act on any file (e.g. <TT>create</TT>, <TT>unlink</TT>), the file descriptors
contains pointer to functions which can only act on open files (e.g.
<TT>read</TT>, <TT>write</TT>).
<A name="section:ext2fs">
<H3>The Second Extended File System</H3>
</A>
<H4>Motivations</H4>
<P>The Second Extended File System has been designed and
implemented to fix some problems present in the first Extended
File System. Our goal was to provide a powerful filesystem,
which implements Unix file semantics and offers advanced
features.
<P>Of course, we wanted to Ext2fs to have excellent
performance. We also wanted to provide a very robust
filesystem in order to reduce the risk of data loss in
intensive use. Last, but not least, Ext2fs had to include
provision for extensions to allow users to benefit from new
features without reformatting their filesystem.
<H4>``Standard'' Ext2fs features</H4>
<P>The Ext2fs supports standard Unix file types: regular files,
directories, device special files and symbolic links.
<P>Ext2fs is able to manage filesystems created on really big
partitions. While the original kernel code restricted the
maximal filesystem size to 2 GB, recent work in the VFS layer
have raised this limit to 4 TB. Thus, it is now possible to use
big disks without the need of creating many partitions.
<P>Ext2fs provides long file names. It uses variable length
directory entries. The maximal file name size is 255
characters. This limit could be extended to 1012 if needed.
<P>Ext2fs reserves some blocks for the super user
(<TT>root</TT>). Normally, 5% of the blocks are reserved. This
allows the administrator to recover easily from situations
where user processes fill up filesystems.
<A name="subsection:ext2fs:adv-feat">
<H4>``Advanced'' Ext2fs features</H4>
</A>
<P>In addition to the standard Unix features, Ext2fs supports
some extensions which are not usually present in Unix
filesystems.
<P>File attributes allow the users to modify the kernel
behavior when acting on a set of files. One can set attributes
on a file or on a directory. In the later case, new files
created in the directory inherit these attributes.
<P>BSD or System V Release 4 semantics can be selected at mount
time. A mount option allows the administrator to choose the
file creation semantics. On a filesystem mounted with BSD
semantics, files are created with the same group id as their
parent directory. System V semantics are a bit more complex: if
a directory has the setgid bit set, new files inherit the group
id of the directory and subdirectories inherit the group id and
the setgid bit; in the other case, files and subdirectories are
created with the primary group id of the calling process.
<P>BSD-like synchronous updates can be used in Ext2fs. A mount
option allows the administrator to request that metadata
(inodes, bitmap blocks, indirect blocks and directory blocks)
be written synchronously on the disk when they are modified.
This can be useful to maintain a strict metadata consistency
but this leads to poor performances. Actually, this feature is
not normally used, since in addition to the performance loss
associated with using synchronous updates of the metadata, it
can cause corruption in the user data which will not be flagged
by the filesystem checker.
<P>Ext2fs allows the administrator to choose the logical block
size when creating the filesystem. Block sizes can typically be
1024, 2048 and 4096 bytes. Using big block sizes can speed up
I/O since fewer I/O requests, and thus fewer disk head seeks,
need to be done to access a file. On the other hand, big blocks
waste more disk space: on the average, the last block allocated
to a file is only half full, so as blocks get bigger, more
space is wasted in the last block of each file. In addition,
most of the advantages of larger block sizes are obtained by
Ext2 filesystem's preallocation techniques (see section
<A href="#subsection:ext2fs:allocation">Performance optimizations</A>).
<P>Ext2fs implements fast symbolic links. A fast symbolic link
does not use any data block on the filesystem. The target name
is not stored in a data block but in the inode itself. This
policy can save some disk space (no data block needs to be
allocated) and speeds up link operations (there is no need to
read a data block when accessing such a link). Of course, the
space available in the inode is limited so not every link can
be implemented as a fast symbolic link. The maximal size of the
target name in a fast symbolic link is 60 characters. We plan
to extend this scheme to small files in the near future.
<P>Ext2fs keeps track of the filesystem state. A special field
in the superblock is used by the kernel code to indicate the
status of the file system. When a filesystem is mounted in
read/write mode, its state is set to ``Not Clean''. When it is
unmounted or remounted in read-only mode, its state is reset to
``Clean''. At boot time, the filesystem checker uses this
information to decide if a filesystem must be checked. The
kernel code also records errors in this field. When an
inconsistency is detected by the kernel code, the filesystem is
marked as ``Erroneous''. The filesystem checker tests this to
force the check of the filesystem regardless of its apparently
clean state.
<P>Always skipping filesystem checks may sometimes be
dangerous, so Ext2fs provides two ways to force checks at
regular intervals. A mount counter is maintained in the
superblock. Each time the filesystem is mounted in read/write
mode, this counter is incremented. When it reaches a maximal
value (also recorded in the superblock), the filesystem checker
forces the check even if the filesystem is ``Clean''. A last
check time and a maximal check interval are also maintained in
the superblock. These two fields allow the administrator to
request periodical checks. When the maximal check interval has
been reached, the checker ignores the filesystem state and
forces a filesystem check.
Ext2fs offers tools to tune the filesystem behavior.
The <TT>tune2fs</TT> program can be used to modify:
<UL>
<LI>the error behavior. When an inconsistency is detected by
the kernel code, the filesystem is marked as ``Erroneous'' and
one of the three following actions can be done: continue normal
execution, remount the filesystem in read-only mode to avoid
corrupting the filesystem, make the kernel panic and reboot to
run the filesystem checker.
<LI>the maximal mount count.
<LI>the maximal check interval.
<LI>the number of logical blocks reserved for the super user.
</UL>
<P>Mount options can also be used to change the kernel error behavior.
<P>An attribute allows the users to request secure deletion on
files. When such a file is deleted, random data is written in
the disk blocks previously allocated to the file. This prevents
malicious people from gaining access to the previous content of
the file by using a disk editor.
<P>Last, new types of files inspired from the 4.4 BSD
filesystem have recently been added to Ext2fs. Immutable files
can only be read: nobody can write or delete them. This can be
used to protect sensitive configuration files. Append-only
files can be opened in write mode but data is always appended
at the end of the file. Like immutable files, they cannot be
deleted or renamed. This is especially useful for log files
which can only grow.
<H4>Physical Structure</H4>
<P>The physical structure of Ext2 filesystems has been strongly
influenced by the layout of the BSD filesystem
<A href="#mckusick:ffs">[McKusick <I>et al.</I> 1984]</A>. A
filesystem is made up of block groups. Block groups are
analogous to BSD FFS's cylinder groups. However, block groups
are not tied to the physical layout of the blocks on the disk,
since modern drives tend to be optimized for sequential access
and hide their physical geometry to the operating system.
<P>The physical structure of a filesystem is represented in this
table:
<TABLE border>
<TR>
<TD>Boot<BR>Sector</TD>
<TD>Block<BR>Group 1</TD>
<TD>Block<BR>Group 2</TD>
<TD>...<BR>...</TD>
<TD>Block<BR>Group N</TD>
</TR>
</TABLE>
<P>Each block group contains a redundant copy of crucial filesystem
control informations (superblock and the filesystem descriptors) and
also contains a part of the filesystem (a block bitmap, an inode
bitmap, a piece of the inode table, and data blocks). The structure of
a block group is represented in this table:
<TABLE border>
<TR>
<TD>Super<BR>Block</TD>
<TD>FS<BR>descriptors</TD>
<TD>Block<BR>Bitmap</TD>
<TD>Inode<BR>Bitmap</TD>
<TD>Inode<BR>Table</TD>
<TD>Data<BR>Blocks</TD>
</TR>
</TABLE>
<P>Using block groups is a big win in terms of reliability:
since the control structures are replicated in each block
group, it is easy to recover from a filesystem where the
superblock has been corrupted. This structure also helps to get
good performances: by reducing the distance between the inode
table and the data blocks, it is possible to reduce the disk
head seeks during I/O on files.
<P>In Ext2fs, directories are managed as linked lists of
variable length entries. Each entry contains the inode number,
the entry length, the file name and its length. By using
variable length entries, it is possible to implement long file
names without wasting disk space in directories. The structure
of a directory entry is shown in this table:
<TABLE border>
<TR>
<TD>inode number</TD><TD>entry length</TD>
<TD>name length</TD><TD>filename</TD>
</TR>
</TABLE>
<P>As an example, The next table represents the structure of a
directory containing three files: <TT>file1</TT>,
<TT>long_file_name</TT>, and <TT>f2</TT>:
<TABLE border>
<TR><TD>i1</TD><TD>16</TD><TD>05</TD><TD><TT>file1 </TT></TD></TR>
</TABLE>
<TABLE border>
<TR><TD>i2</TD><TD>40</TD><TD>14</TD><TD><TT>long_file_name </TT></TD></TR>
</TABLE>
<TABLE border>
<TR><TD>i3</TD><TD>12</TD><TD>02</TD><TD><TT>f2 </TT></TD></TR>
</TABLE>
<A name="subsection:ext2fs:allocation">
<H4>Performance optimizations</H4>
</A>
<P>The Ext2fs kernel code contains many performance
optimizations, which tend to improve I/O speed when reading and
writing files.
<P>Ext2fs takes advantage of the buffer cache management by
performing readaheads: when a block has to be read, the kernel
code requests the I/O on several contiguous blocks. This way,
it tries to ensure that the next block to read will already be
loaded into the buffer cache. Readaheads are normally performed
during sequential reads on files and Ext2fs extends them to
directory reads, either explicit reads (<TT>readdir(2)</TT>
calls) or implicit ones (<TT>namei</TT> kernel directory
lookup).
<P>Ext2fs also contains many allocation optimizations. Block
groups are used to cluster together related inodes and data:
the kernel code always tries to allocate data blocks for a file
in the same group as its inode. This is intended to reduce the
disk head seeks made when the kernel reads an inode and its
data blocks.
<P>When writing data to a file, Ext2fs preallocates up to 8
adjacent blocks when allocating a new block. Preallocation hit
rates are around 75% even on very full filesystems. This
preallocation achieves good write performances under heavy
load. It also allows contiguous blocks to be allocated to
files, thus it speeds up the future sequential reads.
<P>These two allocation optimizations produce a very good locality of:
<UL>
<LI>related files through block groups
<LI>related blocks through the 8 bits clustering of block allocations.
</UL>
<H3>The Ext2fs library</H3>
<P>To allow user mode programs to manipulate the control
structures of an Ext2 filesystem, the libext2fs library was
developed. This library provides routines which can be used to
examine and modify the data of an Ext2 filesystem, by accessing
the filesystem directly through the physical device.
<P>The Ext2fs library was designed to allow maximal code reuse
through the use of software abstraction techniques. For
example, several different iterators are provided. A program
can simply pass in a function to
<TT>ext2fs_block_interate()</TT>, which will be called for each
block in an inode. Another iterator function allows an
user-provided function to be called for each file in a
directory.
<P>Many of the Ext2fs utilities (<TT>mke2fs</TT>,
<TT>e2fsck</TT>, <TT>tune2fs</TT>, <TT>dumpe2fs</TT>, and
<TT>debugfs</TT>) use the Ext2fs library. This greatly
simplifies the maintainance of these utilities, since any
changes to reflect new features in the Ext2 filesystem format
need only be made in one place--in the Ext2fs library. This
code reuse also results in smaller binaries, since the Ext2fs
library can be built as a shared library image.
<P>Because the interfaces of the Ext2fs library are so abstract
and general, new programs which require direct access to the
Ext2fs filesystem can very easily be written. For example, the
Ext2fs library was used during the port of the 4.4BSD dump and
restore backup utilities. Very few changes were needed to adapt
these tools to Linux: only a few filesystem dependent functions
had to be replaced by calls to the Ext2fs library.
<P>The Ext2fs library provides access to several classes of
operations. The first class are the filesystem-oriented
operations. A program can open and close a filesystem, read
and write the bitmaps, and create a new filesystem on the disk.
Functions are also available to manipulate the filesystem's bad
blocks list.
<P>The second class of operations affect directories. A caller
of the Ext2fs library can create and expand directories, as
well as add and remove directory entries. Functions are also
provided to both resolve a pathname to an inode number, and to
determine a pathname of an inode given its inode number.
<P>The final class of operations are oriented around inodes.
It is possible to scan the inode table, read and write inodes,
and scan through all of the blocks in an inode. Allocation and
deallocation routines are also available and allow user mode
programs to allocate and free blocks and inodes.
<H3>The Ext2fs tools</H3>
<P>Powerful management tools have been developed for Ext2fs.
These utilities are used to create, modify, and correct any
inconsistencies in Ext2 filesystems. The <TT>mke2fs</TT>
program is used to initialize a partition to contain an empty
Ext2 filesystem.
<P>The <TT>tune2fs</TT> program can be used to modify the filesystem
parameters. As explained in section <A href="#subsection:ext2fs:adv-feat">
``Advanced'' Ext2fs features</A>, it can change the error
behavior, the maximal mount count, the maximal check interval,
and the number of logical blocks reserved for the super user.
<P>The most interesting tool is probably the filesystem
checker. <TT>E2fsck</TT> is intended to repair filesystem
inconsistencies after an unclean shutdown of the system. The
original version of <TT>e2fsck</TT> was based on Linus
Torvald's fsck program for the Minix filesystem. However, the
current version of <TT>e2fsck</TT> was rewritten from scratch,
using the Ext2fs library, and is much faster and can correct
more filesystem inconsistencies than the original version.
<P>The <TT>e2fsck</TT> program is designed to run as quickly as
possible. Since filesystem checkers tend to be disk bound,
this was done by optimizing the algorithms used by
<TT>e2fsck</TT> so that filesystem structures are not
repeatedly accessed from the disk. In addition, the order in
which inodes and directories are checked are sorted by block
number to reduce the amount of time in disk seeks. Many of
these ideas were originally explored by
<A href="#bsd:fsck">[Bina and Emrath 1989]</A> although they have
since been further refined by the authors.
<P>In pass 1, <TT>e2fsck</TT> iterates over all of the inodes
in the filesystem and performs checks over each inode as an
unconnected object in the filesystem. That is, these checks do
not require any cross-checks to other filesystem objects.
Examples of such checks include making sure the file mode is
legal, and that all of the blocks in the inode are valid block
numbers. During pass 1, bitmaps indicating which blocks and
inodes are in use are compiled.
<P>If <TT>e2fsck</TT> notices data blocks which are claimed by
more than one inode, it invokes passes 1B through 1D to resolve
these conflicts, either by cloning the shared blocks so that
each inode has its own copy of the shared block, or by
deallocating one or more of the inodes.
<P>Pass 1 takes the longest time to execute, since all of the
inodes have to be read into memory and checked. To reduce the
I/O time necessary in future passes, critical filesystem
information is cached in memory. The most important example of
this technique is the location on disk of all of the directory
blocks on the filesystem. This obviates the need to re-read
the directory inodes structures during pass 2 to obtain this
information.
<P>Pass 2 checks directories as unconnected objects. Since
directory entries do not span disk blocks, each directory block
can be checked individually without reference to other
directory blocks. This allows <TT>e2fsck</TT> to sort all of
the directory blocks by block number, and check directory
blocks in ascending order, thus decreasing disk seek time. The
directory blocks are checked to make sure that the directory
entries are valid, and contain references to inode numbers
which are in use (as determined by pass 1).
<P>For the first directory block in each directory inode, the
`.' and `..' entries are checked to make sure they exist, and
that the inode number for the `.' entry matches the current
directory. (The inode number for the `..' entry is not checked
until pass 3.)
<P>Pass 2 also caches information concerning the parent
directory in which each directory is linked. (If a directory
is referenced by more than one directory, the second reference
of the directory is treated as an illegal hard link, and it is
removed).
<P>It is noteworthy to note that at the end of pass 2, nearly
all of the disk I/O which <TT>e2fsck</TT> needs to perform is
complete. Information required by passes 3, 4 and 5 are cached
in memory; hence, the remaining passes of <TT>e2fsck</TT> are
largely CPU bound, and take less than 5-10% of the total
running time of <TT>e2fsck</TT>.
<P>In pass 3, the directory connectivity is checked.
<TT>E2fsck</TT> traces the path of each directory back to the
root, using information that was cached during pass 2. At this
time, the `..' entry for each directory is also checked to make
sure it is valid. Any directories which can not be traced back
to the root are linked to the <TT>/lost+found</TT> directory.
<P>In pass 4, <TT>e2fsck</TT> checks the reference counts for
all inodes, by iterating over all the inodes and comparing the
link counts (which were cached in pass 1) against internal
counters computed during passes 2 and 3. Any undeleted files
with a zero link count is also linked to the
<TT>/lost+found</TT> directory during this pass.
<P>Finally, in pass 5, <TT>e2fsck</TT> checks the validity of
the filesystem summary information. It compares the block and
inode bitmaps which were constructed during the previous passes
against the actual bitmaps on the filesystem, and corrects the
on-disk copies if necessary.
<P>The filesystem debugger is another useful tool.
<TT>Debugfs</TT> is a powerful program which can be used to
examine and change the state of a filesystem. Basically, it
provides an interactive interface to the Ext2fs library:
commands typed by the user are translated into calls to the
library routines.
<P><TT>Debugfs</TT> can be used to examine the internal
structures of a filesystem, manually repair a corrupted
filesystem, or create test cases for <TT>e2fsck</TT>.
Unfortunately, this program can be dangerous if it is used by
people who do not know what they are doing; it is very easy to
destroy a filesystem with this tool. For this reason,
<TT>debugfs</TT> opens filesytems for read-only access by
default. The user must explicitly specify the <TT>-w</TT> flag
in order to use <TT>debugfs</TT> to open a filesystem for
read/wite access.
<H3>Performance Measurements</H3>
<H4>Description of the benchmarks</H4>
<P>We have run benchmarks to measure filesystem performances.
Benchmarks have been made on a middle-end PC, based on a
i486DX2 processor, using 16 MB of memory and two 420 MB IDE
disks. The tests were run on Ext2 fs and Xia fs (Linux 1.1.62)
and on the BSD Fast filesystem in asynchronous and synchronous
mode (FreeBSD 2.0 Alpha--based on the 4.4BSD Lite
distribution).
<P>We have run two different benchmarks. The Bonnie benchmark
tests I/O speed on a big file--the file size was set to 60 MB
during the tests. It writes data to the file using character
based I/O, rewrites the contents of the whole file, writes data
using block based I/O, reads the file using character I/O and
block I/O, and seeks into the file. The Andrew Benchmark was
developed at Carneggie Mellon University and has been used at
the University of Berkeley to benchmark BSD FFS and LFS. It
runs in five phases: it creates a directory hierarchy, makes a
copy of the data, recursively examine the status of every file,
examine every byte of every file, and compile several of the
files.
<H4>Results of the Bonnie benchmark</H4>
<P>The results of the Bonnie benchmark are presented in this
table:
<TABLE border>
<TR><TH></TH><TH>Char Write<BR>(KB/s)</TH>
<TH>Block Write<BR>(KB/s)</TH>
<TH>Rewrite<BR>(KB/s)</TH>
<TH>Char Read<BR>(KB/s)</TH>
<TH>Block Read<BR>(KB/s)</TH></TR>
<TR><TD>BSD Async</TD><TD align="right">710</TD><TD align="right">684</TD><TD align="right">401</TD><TD align="right">721</TD><TD align="right">888</TD></TR>
<TR><TD>BSD Sync</TD><TD align="right">699</TD><TD align="right">677</TD><TD align="right">400</TD><TD align="right">710</TD><TD align="right">878</TD></TR>
<TR><TD>Ext2 fs</TD><TD align="right">452</TD><TD align="right">1237</TD><TD align="right">536</TD><TD align="right">397</TD><TD align="right">1033</TD></TR>
<TR><TD>Xia fs</TD><TD align="right">440</TD><TD align="right">704</TD><TD align="right">380</TD><TD align="right">366</TD><TD align="right">895</TD></TR>
</TABLE>
<P>The results are very good in block oriented I/O: Ext2 fs
outperforms other filesystems. This is clearly a benefit of the
optimizations included in the allocation routines. Writes are
fast because data is written in cluster mode. Reads are fast
because contiguous blocks have been allocated to the file. Thus
there is no head seek between two reads and the readahead
optimizations can be fully used.
<P>On the other hand, performance is better in the FreeBSD
operating system in character oriented I/O. This is probably
due to the fact that FreeBSD and Linux do not use the same
stdio routines in their respective C libraries. It seems that
FreeBSD has a more optimized character I/O library and its
performance is better.
<H4>Results of the Andrew benchmark</H4>
The results of the Andrew benchmark are presented in
this table:
<TABLE border>
<TR>
<TH></TH>
<TH>P1 Create<BR>(ms)</TH>
<TH>P2 Copy<BR>(ms)</TH>
<TH>P3 Stat<BR>(ms)</TH>
<TH>P4 Grep<BR>(ms)</TH>
<TH>P5 Compile<BR>(ms)</TH>
</TR>
<TR><TD>BSD Async</TD><TD align="right">2203</TD><TD align="right">7391</TD><TD align="right">6319</TD><TD align="right">17466</TD><TD align="right">75314</TD></TR>
<TR><TD>BSD Sync</TD><TD align="right">2330</TD><TD align="right">7732</TD><TD align="right">6317</TD><TD align="right">17499</TD><TD align="right">75681</TD></TR>
<TR><TD>Ext2 fs</TD><TD align="right">790</TD><TD align="right">4791</TD><TD align="right">7235</TD><TD align="right">11685</TD><TD align="right">63210</TD></TR>
<TR><TD>Xia fs</TD><TD align="right">934</TD><TD align="right">5402</TD><TD align="right">8400</TD><TD align="right">12912</TD><TD align="right">66997</TD></TR>
</TABLE>
<P>The results of the two first passes show that Linux benefits
from its asynchronous metadata I/O. In passes 1 and 2,
directories and files are created and BSD synchronously writes
inodes and directory entries. There is an anomaly, though: even
in asynchronous mode, the performance under BSD is poor. We
suspect that the asynchronous support under FreeBSD is not
fully implemented.
<P>In pass 3, the Linux and BSD times are very similar. This is
a big progress against the same benchmark run six months ago.
While BSD used to outperform Linux by a factor of 3 in this
test, the addition of a file name cache in the VFS has fixed
this performance problem.
<P>In passes 4 and 5, Linux is faster than FreeBSD mainly
because it uses an unified buffer cache management. The buffer
cache space can grow when needed and use more memory than the
one in FreeBSD, which uses a fixed size buffer cache.
Comparison of the Ext2fs and Xiafs results shows that the
optimizations included in Ext2fs are really useful: the
performance gain between Ext2fs and Xiafs is around 5-10%.
<H3>Conclusion</H3>
<P>The Second Extended File System is probably the most widely
used filesystem in the Linux community. It provides standard
Unix file semantics and advanced features. Moreover, thanks to
the optimizations included in the kernel code, it is robust and
offers excellent performance.
<P>Since Ext2fs has been designed with evolution in mind, it
contains hooks that can be used to add new features. Some
people are working on extensions to the current filesystem:
access control lists conforming to the Posix semantics
<A href="#posix6">[IEEE 1992]</A>, undelete, and on-the-fly
file compression.
<P>Ext2fs was first developed and integrated in the Linux
kernel and is now actively being ported to other operating
systems. An Ext2fs server running on top of the GNU Hurd has
been implemented. People are also working on an Ext2fs port in
the LITES server, running on top of the Mach microkernel
<A href="#mach:foundation">[Accetta <I>et al.</I> 1986]</A>, and
in the VSTa operating system. Last, but not least, Ext2fs is an
important part of the Masix operating system
<A href="#masix:osf">[Card <I>et al.</I> 1993]</A>,
currently under development by one of the authors.
<H3>Acknowledgments</H3>
<P>The Ext2fs kernel code and tools have been written mostly by
the authors of this paper. Some other people have also
contributed to the development of Ext2fs either by suggesting
new features or by sending patches. We want to thank these
contributors for their help.
<H3>References</H3>
<P><A name="mach:foundation">[Accetta <I>et al.</I> 1986]</A>
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and
M. Young.
Mach: A New Kernel Foundation For UNIX Development.
In <I>Proceedings of the USENIX 1986 Summer Conference</I>, June 1986.
<P><A name="bach">[Bach 1986]</A>
M. Bach.
<I>The Design of the UNIX Operating System</I>.
Prentice Hall, 1986.
<P><A name="bsd:fsck">[Bina and Emrath 1989]</A>
E. Bina and P. Emrath.
A Faster fsck for BSD Unix.
In <I>Proceedings of the USENIX Winter Conference</I>, January 1989.
<P><A name="masix:osf">[Card <I>et al.</I> 1993]</A>
R. Card, E. Commelin, S. Dayras, and F. M<>vel.
The MASIX Multi-Server Operating System.
In <I>OSF Workshop on Microkernel Technology for Distributed Systems</I>,
June 1993.
<P><A name="posix6">[IEEE 1992]</A>
<I>SECURITY INTERFACE for the Portable Operating System Interface for
Computer Environments - Draft 13</I>.
Institute of Electrical and Electronics Engineers, Inc, 1992.
<P><A name="vnodes">[Kleiman 1986]</A>
S. Kleiman.
Vnodes: An Architecture for Multiple File System Types
in Sun UNIX.
In <I>Proceedings of the Summer USENIX Conference</I>, pages 260--269,
June 1986.
<P><A name="mckusick:ffs">[McKusick <I>et al.</I> 1984]</A>
M. McKusick, W. Joy, S. Leffler, and R. Fabry.
A Fast File System for UNIX.
<I>ACM Transactions on Computer Systems</I>, 2(3):181--197, August
1984.
<P><A name="lfs:unix">[Seltzer <I>et al.</I> 1993]</A>
M. Seltzer, K. Bostic, M. McKusick, and C. Staelin.
An Implementation of a Log-Structured File System for
UNIX.
In <I>Proceedings of the USENIX Winter Conference</I>, January 1993.
<P><A name="minix">[Tanenbaum 1987]</A>
A. Tanenbaum.
<I>Operating Systems: Design and Implementation</I>.
Prentice Hall, 1987.
<P>
<HR>
<P>Thanks to Michael Johnson for HTMLizing it (originally for use in
the <A HREF="http://khg.redhat.com/HyperNews/get/fs/fs.html"> Kernel
Hacker's Guide</A>).</P>
</BODY>
</HTML>

Binary file not shown.

View File

@@ -0,0 +1,83 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<HTML>
<body bgcolor="#ffffff" text="#000000">
<font size="+1">Second Extended File System</font><BR>
by Dave Poirier (<a href="mailto:instinc@users.sf.net">instinc@users.sf.net</a>)<br>
<a href="http://savannah.gnu.org/projects/ext2-doc/">Project Page</a>
<p>
<center>View the information in:<br>
<a href="http://freesoftware.fsf.org/download/ext2-doc/ext2.dvi">dvi</a> -
<a href="ext2.html">html</a> -
<a href="http://freesoftware.fsf.org/download/ext2-doc/ext2.pdf">pdf</a> -
<a href="http://freesoftware.fsf.org/download/ext2-doc/ext2.ps">ps</a> -
<a href="http://freesoftware.fsf.org/download/ext2-doc/ext2.rtf">rtf</a>
<p><font size="-1">Last update: August 5th, 2002</font></center>
<p>
When I was working at my first ext2 driver implementation, I found myself short of
documentation on the subject. It wasn't so much the information not being available
as of it not being available all in one place.
<p>
This project tries to fix this, by bringing in one single place all the useful
information in one easy to understand package. I try to not tie the documentation
to any particular operating system, so that it may be useful to the widest audience.
<p>
<b>Change Log</b>
<hr>
August 5th, 2002
<blockquote>
<li>Added a note to .i_blocks and .i_dtime</li>
</blockquote>
August 2nd, 2002
<blockquote>
<li>Updated the values of EXT2_S_IFLNK and EXT2_S_IFSOCK as noted by Jeremy Stanley of AccessData Inc</li>
<li>Added a note about the reserved inode entries</li>
</blockquote>
July 31st, 2002
<blockquote>
<li>Fixed the 0 and 1 definitions for the block and inode bitmaps.</li>
</blockquote>
June 16th, 2002
<blockquote>
<li>Cleared up the confusion about the location of the group descriptors in section 'Group Descriptor'</li>
</blockquote>
April 1st, 2002
<blockquote>
<li>Added the description of EXT2_INDEX_FL (Hash Indexed Directory)</li>
<li>Fixed many table layouts</li>
</blockquote>
March 31st, 2002
<blockquote>
<li>Added the Indexed Directory Format</li>
<li>Added .i_flags descriptions</li>
<li>Added a collaborator section and a credits appendix</li>
<li>Added some notes for compat/incompat features</li>
<li>Completed the inode chapter</li>
</blockquote>
March 25th, 2002
<blockquote>
<li>Added extended attributes</li>
</blockquote>
<hr>
References:
<li><a href="http://www.science.unitn.it/~fiorella/guidelinux/tlk/node95.html">Physical Layout</a></li>
<li><a href="http://e2fsprogs.sourceforge.net/">e2fsprogs (e2fsck)</a></li>
<li><a href="http://e2fsprogs.sourceforge.net/ext2intro.html">Design & Implementation</a></li>
<li><a href="ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/">Journaling (ext3)</a></li>
<li><a href="http://kernelnewbies.org/~phillips/htree/">Hashed Directories</a></li>
<li><a href="http://ext2resize.sourceforge.net/">Filesystem Resizing</a></li>
<li><a href="http://acl.bestbits.at/">Extended Attributes &amp; Access Control Lists</a></li>
<li><a href="http://www.netspace.net.au/~reiter/e2compr/">Compression</a> (*)</li>
<BR><BR>
Implementations for:
<li><a href="http://uranus.it.swin.edu.au/~jn/linux/explore2fs.htm">Windows 95/98/NT/2000</a></li>
<li><a href="http://www.yipton.demon.co.uk/content.html#FSDEXT2">Windows 95</a> (*)</li>
<li><a href="ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/">DOS client</a> (*)</li>
<li><a href="http://perso.wanadoo.fr/matthieu.willm/ext2-os2/">OS/2</a> (*)</li>
<!-- invalid url .. <li><a href="ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/">RISC OS client</a></li> -->
<li><a href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/uuu/dimension/cell/fs/ext2/ext2.asm">Unununium</a></li>
<BR><BR>
(*) no longer actively developed/supported (as of March 2002)
<hr>
<center>graciously hosted by <a href="http://savannah.gnu.org">Savannah</a></center>
</body>
</HTML>

View File

@@ -0,0 +1,83 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_1.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Introduction</TITLE>
<P>Go to the <A HREF="ext2fs_2.html">next</A> section.<P>
<P>
Copyright (C) 1994 Louis-Dominique Dubeau.
<P>
You may without charge, royalty or other payment, copy and distribute
copies of this work and derivative works of this work in source or
binary form provided that:
<P>
(1) you appropriately publish on each copy an appropriate copyright
notice; (2) faithfully reproduce all prior copyright notices included in
the original work (you may add your own copyright notice); and (3) agree
to indemnify and hold all prior authors, copyright holders and licensors
of the work harmless from and against all damages arising from the use
of the work.
<P>
You may distribute sources of derivative works of the work provided
that:
<P>
(1) (a) all source files of the original work that have been modified,
(b) all source files of the derivative work that contain any party of the
original work, and (c) all source files of the derivative work that are
necessary to compile, link and run the derivative work without
unresolved external calls and with the same functionality of the
original work ("Necessary Sources") carry a prominent notice explaining
the nature and date of the modification and/or creation. You are
encouraged to make the Necessary Sources available under this license in
order to further development and acceptance of the work.
<P>
EXCEPT AS OTHERWISE RESTRICTED BY LAW, THIS WORK IS PROVIDED WITHOUT ANY
EXPRESS OR IMPLIED WARRANTIES OF ANY KIND, INCLUDING BUT NOT LIMITED TO,
ANY IMPLIED WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY OR TITLE. EXCEPT AS OTHERWISE PROVIDED BY LAW, NO
AUTHOR, COPYRIGHT HOLDER OR LICENSOR SHALL BE LIABLE TO YOU FOR DAMAGES
OF ANY KIND, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
<P>
<H1><A NAME="SEC1" HREF="ext2fs_toc.html#SEC1">Introduction</A></H1>
<P>
This document has been written by Louis-Dominique Dubeau. It contains
an analysis of the structure of the Second Extended File System and is
based on a study of the Linux kernel source files. This document does
not contain specifications written by the Ext2fs development team.
<P>
Ext2fs was designed by
R<EFBFBD>my Card <A NAME="FOOT1" HREF="ext2fs_foot.html#FOOT1">(1)</A>
as an extensible and powerful file system for Linux. It is also the most
successful file system so far in the Linux community.
<P>
The first Linux file system was Minixfs: a file system originally
developed for the Minix operating system. This file system had many
disadvantages. Among them was: the 64MB limit on partitions, the 14
characters limit on file names and no built in extensibility.
<P>
To overcome those problems,
R<EFBFBD>my Card
wrote extfs. This file system was mostly based upon the original Minixfs
code and implementation. However, it removed the 64MB size limit on
partitions, and increased the file name size limit to 255 characters.
<P>
In his quest for the perfect file system,
R<EFBFBD>my
was still unsatisfied. So he decided to write an brand new file system:
ext2fs. This file system not only has the advantages of extfs but also
provides a better space allocation management, allows the use of special
flags for file management, the use of access control lists and is
extensible.
<P>
Will someday
R<EFBFBD>my
come up with ext3fs? Who knows? However, in the meantime ext2fs is
<STRONG>the</STRONG> de-facto standard Linux file system. This document describes
the physical layout of an ext2 file system on disk and the management
policies that every ext2 file system managers should implement. The
information in this document is accurate as of version 0.5 of ext2fs
(Linux kernel version 1.0). The information about access control lists
is not included because no implementation of ext2fs enforce them anyway.
<P>
<P>Go to the <A HREF="ext2fs_2.html">next</A> section.<P>

View File

@@ -0,0 +1,49 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_10.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Error Handling</TITLE>
<P>Go to the <A HREF="ext2fs_9.html">previous</A>, <A HREF="ext2fs_11.html">next</A> section.<P>
<A NAME="IDX48"></A>
<A NAME="IDX49"></A>
<H1><A NAME="SEC10" HREF="ext2fs_toc.html#SEC10">Error Handling</A></H1>
<P>
This chapter describes how a standard ext2 file system must handle
errors. The superblock contains two parameters controlling the way
errors are handled. See section <A HREF="ext2fs_4.html#SEC4">Superblock</A>
<P>
The first of these is the <CODE>s_mount_opt</CODE> member of the superblock
structure in memory. Its value is computed from the options specified
when the fs is mounted. Its error handling related values are:
<P>
<DL COMPACT>
<DT><CODE>EXT2_MOUNT_ERRORS_CONT</CODE>
<DD>continue even if an error occurs.
<P>
<DT><CODE>EXT2_MOUNT_ERRORS_RO</CODE>
<DD>remount the file system read only.
<P>
<DT><CODE>EXT2_MOUNT_ERRORS_PANIC</CODE>
<DD>the kernel panics on error.
</DL>
<P>
The second of these is the <CODE>s_errors</CODE> member of the superblock
structure on disk. It may take one of the following values:
<P>
<DL COMPACT>
<DT><CODE>EXT2_ERRORS_CONTINUE</CODE>
<DD>continue even if an error occurs.
<P>
<DT><CODE>EXT2_ERRORS_RO</CODE>
<DD>remount the file system read only.
<P>
<DT><CODE>EXT2_ERRORS_PANIC</CODE>
<DD>in which case the kernel simply panics.
<P>
<DT><CODE>EXT2_ERRORS_DEFAULT</CODE>
<DD>use the default behavior (as of 0.5a <CODE>EXT2_ERRORS_CONTINUE</CODE>).
</DL>
<P>
<CODE>s_mount_opt</CODE> has precedence on <CODE>s_errors</CODE>.
<P>Go to the <A HREF="ext2fs_9.html">previous</A>, <A HREF="ext2fs_11.html">next</A> section.<P>

View File

@@ -0,0 +1,16 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_11.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Formulae</TITLE>
<P>Go to the <A HREF="ext2fs_10.html">previous</A>, <A HREF="ext2fs_12.html">next</A> section.<P>
<H1><A NAME="SEC11" HREF="ext2fs_toc.html#SEC11">Formulae</A></H1>
<P>
Here are a couple of formulae usually used in ext2fs managers.
<P>
The block number of a file relative offset:
<P>
block = offset / s_blocksize
<P>
<P>Go to the <A HREF="ext2fs_10.html">previous</A>, <A HREF="ext2fs_12.html">next</A> section.<P>

View File

@@ -0,0 +1,17 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_12.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Invariants</TITLE>
<P>Go to the <A HREF="ext2fs_11.html">previous</A>, <A HREF="ext2fs_13.html">next</A> section.<P>
<H1><A NAME="SEC12" HREF="ext2fs_toc.html#SEC12">Invariants</A></H1>
<P>
Here we define a set of invariant propositions. These propositions can
be momentarily false during file manipulations in the ext2 file system
manager. However, file invariants should be always be true for the set
of files not currently manipulated by the file system manager. File
system invariants should always be true when the file system manager is
not currently manipulating the file system.
<P>
<P>Go to the <A HREF="ext2fs_11.html">previous</A>, <A HREF="ext2fs_13.html">next</A> section.<P>

View File

@@ -0,0 +1,10 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_13.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - File Invariants</TITLE>
<P>Go to the <A HREF="ext2fs_12.html">previous</A>, <A HREF="ext2fs_14.html">next</A> section.<P>
<H2><A NAME="SEC13" HREF="ext2fs_toc.html#SEC13">File Invariants</A></H2>
<P>
<P>Go to the <A HREF="ext2fs_12.html">previous</A>, <A HREF="ext2fs_14.html">next</A> section.<P>

View File

@@ -0,0 +1,9 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_14.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - File System Invariants</TITLE>
<P>Go to the <A HREF="ext2fs_13.html">previous</A>, <A HREF="ext2fs_15.html">next</A> section.<P>
<H2><A NAME="SEC14" HREF="ext2fs_toc.html#SEC14">File System Invariants</A></H2>
<P>Go to the <A HREF="ext2fs_13.html">previous</A>, <A HREF="ext2fs_15.html">next</A> section.<P>

View File

@@ -0,0 +1,30 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_15.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - References</TITLE>
<P>Go to the <A HREF="ext2fs_14.html">previous</A>, <A HREF="ext2fs_16.html">next</A> section.<P>
<H1><A NAME="SEC15" HREF="ext2fs_toc.html#SEC15">References</A></H1>
<P>
Here are cited the sources used to write this document. Everything is
cited: books, sources, man pages, etc.
<A NAME="FOOT4" HREF="ext2fs_foot.html#FOOT4">(4)</A>
<P>
Card, R<>my 1993. <EM>Impl<EFBFBD>mentation du syst<73>me de fichiers ext2 dans Linux</EM>,
Rapport MASI, Institut Blaise Pascal, Paris, France.
<P>
Card, R<>my, et al. 1994. The ext2fs sources in Linux kernel. Available
by ftp at nic.funet.fi.
<P>
Card, R<>my, Ts'o, Theodore and Tweedie, Stephen. 1994. <EM>Linux File
Systems</EM>. Available at
ftp://ftp.ibp.fr/pub2/linux/packages/ext2fs/ext2-1.eps.gz
<P>
Torvalds, Linus, et al. 1994. The Linux 1.0 kernel sources. Available
by ftp at nic.funet.fi.
<P>
Ts'o, Theodore, and Card, R<>my. 1994. The e2fsprogs-0.5a sources. Available
by ftp at sunsite.unc.edu
<P>
<P>Go to the <A HREF="ext2fs_14.html">previous</A>, <A HREF="ext2fs_16.html">next</A> section.<P>

View File

@@ -0,0 +1,76 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_16.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Concept Index</TITLE>
<P>Go to the <A HREF="ext2fs_15.html">previous</A> section.<P>
<H1><A NAME="SEC16" HREF="ext2fs_toc.html#SEC16">Concept Index</A></H1>
<P>
<DIR>
<H2>a</H2>
<LI><A HREF="ext2fs_8.html#IDX44">Access path</A>
<LI><A HREF="ext2fs_7.html#IDX39">ACL inode</A>
<H2>b</H2>
<LI><A HREF="ext2fs_7.html#IDX36">Bad blocks list</A>
<LI><A HREF="ext2fs_4.html#IDX20">Bitmap cache</A>
<LI><A HREF="ext2fs_6.html#IDX22">Bitmaps, in general</A>
<LI><A HREF="ext2fs_6.html#IDX23">Block allocation and bitmaps</A>
<LI><A HREF="ext2fs_6.html#IDX24">Block bitmap</A>
<LI><A HREF="ext2fs_2.html#IDX1">Blocks, in general</A>
<LI><A HREF="ext2fs_7.html#IDX40">Boot loader inode</A>
<H2>c</H2>
<LI><A HREF="ext2fs_4.html#IDX21">Caching of bitmaps</A>
<LI><A HREF="ext2fs_3.html#IDX17">Content of a group</A>
<LI><A HREF="ext2fs_7.html#IDX33">Content of an inode</A>
<LI><A HREF="ext2fs_8.html#IDX47">Current directory</A>
<H2>d</H2>
<LI><A HREF="ext2fs_2.html#IDX3">Definition of a block</A>
<LI><A HREF="ext2fs_2.html#IDX11">Definition of a fragment</A>
<LI><A HREF="ext2fs_8.html#IDX43">Directories, in general</A>
<LI><A HREF="ext2fs_8.html#IDX45">Directory entries</A>
<LI><A HREF="ext2fs_3.html#IDX15">Duplication of information</A>
<H2>e</H2>
<LI><A HREF="ext2fs_10.html#IDX49">Error handling</A>
<LI><A HREF="ext2fs_10.html#IDX48">Errors, in general</A>
<H2>f</H2>
<LI><A HREF="ext2fs_7.html#IDX42">First normal inode</A>
<LI><A HREF="ext2fs_2.html#IDX13">Fragment size</A>
<LI><A HREF="ext2fs_2.html#IDX2">Fragments, in general</A>
<H2>g</H2>
<LI><A HREF="ext2fs_3.html#IDX14">Groups, in general</A>
<H2>i</H2>
<LI><A HREF="ext2fs_3.html#IDX16">Information duplication</A>
<LI><A HREF="ext2fs_6.html#IDX25">Inode allocation and bitmaps</A>
<LI><A HREF="ext2fs_6.html#IDX26">Inode bitmap</A>
<LI><A HREF="ext2fs_7.html#IDX32">Inode content</A>
<LI><A HREF="ext2fs_7.html#IDX28">Inode layout</A>
<LI><A HREF="ext2fs_7.html#IDX31">Inode structure</A>
<LI><A HREF="ext2fs_7.html#IDX27">Inodes, in general</A>
<H2>l</H2>
<LI><A HREF="ext2fs_3.html#IDX18">Layout of a group</A>
<LI><A HREF="ext2fs_7.html#IDX29">Layout of a inode</A>
<LI><A HREF="ext2fs_7.html#IDX35">List of bad blocks</A>
<LI><A HREF="ext2fs_2.html#IDX10">Logical addresses range</A>
<LI><A HREF="ext2fs_2.html#IDX8">Logical block size</A>
<LI><A HREF="ext2fs_2.html#IDX6">Logical versus physical addresses</A>
<LI><A HREF="ext2fs_2.html#IDX5">Logical versus physical blocks</A>
<H2>p</H2>
<LI><A HREF="ext2fs_8.html#IDX46">Parent directory</A>
<LI><A HREF="ext2fs_2.html#IDX7">Physical blocks</A>
<H2>r</H2>
<LI><A HREF="ext2fs_2.html#IDX4">Reserved blocks</A>
<LI><A HREF="ext2fs_7.html#IDX38">Root directory</A>
<LI><A HREF="ext2fs_7.html#IDX37">Root inode</A>
<H2>s</H2>
<LI><A HREF="ext2fs_2.html#IDX12">Size of a fragment</A>
<LI><A HREF="ext2fs_2.html#IDX9">Size of logical blocks</A>
<LI><A HREF="ext2fs_7.html#IDX34">Special inodes</A>
<LI><A HREF="ext2fs_7.html#IDX30">Structure of an inode</A>
<H2>t</H2>
<LI><A HREF="ext2fs_4.html#IDX19">Times</A>
<H2>u</H2>
<LI><A HREF="ext2fs_7.html#IDX41">Undelete directory inode</A>
</DIR>
<P>
<P>Go to the <A HREF="ext2fs_15.html">previous</A> section.<P>

View File

@@ -0,0 +1,71 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_2.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Blocks and Fragments</TITLE>
<P>Go to the <A HREF="ext2fs_1.html">previous</A>, <A HREF="ext2fs_3.html">next</A> section.<P>
<A NAME="IDX1"></A>
<A NAME="IDX2"></A>
<H1><A NAME="SEC2" HREF="ext2fs_toc.html#SEC2">Blocks and Fragments</A></H1>
<A NAME="IDX3"></A>
<P>
Blocks are the basic building blocks of a file system. The file system
manager requests to read or write from the disk are always translated to
a query to read or write an integral number of blocks from the disk.
<A NAME="IDX4"></A>
<P>
Some blocks on the file system are reserved for the exclusive use of the
superuser. This information is recorded in the <CODE>s_r_blocks_count</CODE>
member of the superblock structure. See section <A HREF="ext2fs_4.html#SEC4">Superblock</A> Whenever the total
number of free blocks becomes equal to the number of reserved blocks,
the normal users can no longer allocate blocks for their use. Only the
superuser may allocate new blocks. Without this provision for reserved
blocks, filling up the file system might make the computer unbootable.
Whenever the startup tasks would try to allocate a block, the computer
would crash. With reserved blocks, we ensure a minimum space for booting
and allowing the superuser to clean up the disk.
<A NAME="IDX5"></A>
<A NAME="IDX6"></A>
<A NAME="IDX7"></A>
<A NAME="IDX8"></A>
<A NAME="IDX9"></A>
<P>
This is all very simple. However, computer scientists like to
complicates things a bit. There are in fact two kinds of blocks, logical
blocks and physical blocks. The addressing scheme and size of these two
kind of blocks may vary. What happens is that when a request is made to
manipulate the range <SAMP>`[a,b]'</SAMP> of some file, this range is first
converted by the higher parts of the file system into a request to
manipulate an integral number of logical blocks: <SAMP>`a'</SAMP> is rounded
down to a logical block boundary and, <SAMP>`b'</SAMP> is rounded up to a
logical block boundary. Then, this range of logical blocks is converted
by lower parts of the file system into a request to manipulate an
integral number of physical blocks. The logical block size must be the
physical block size multiplied by a power of two <A NAME="FOOT2" HREF="ext2fs_foot.html#FOOT2">(2)</A>. So when going from logical to physical addressing
we just have to multiply the address by this power of two.
<A NAME="IDX10"></A>
<P>
The logical addresses of the file system goes from zero up to the total
number of blocks minus one. Block zero is the boot block and is usually
only accessed during special operations.
<P>
Now, the problem with blocks is that if we have a file that is not an
integral number of blocks, space at the end of the last block is wasted.
On average, one half block is wasted per file. On most file systems this
means a lot of wasted space.
<A NAME="IDX11"></A>
<A NAME="IDX12"></A>
<A NAME="IDX13"></A>
<P>
To circumvent this inconvenience, the file system uses fragments. The
fragment size must be the physical block size multiplied by a power of
two <A NAME="FOOT3" HREF="ext2fs_foot.html#FOOT3">(3)</A>. A file is therefore a sequence of
blocks followed by a small sequence of consecutive fragments. When a file
has enough ending fragments to fill a block, those fragments are grouped
into a block. When a file is shortened, the last block may be broken into
many contiguous fragments.
<P>
The general relationship between sizes is:
<P>
<P>Go to the <A HREF="ext2fs_1.html">previous</A>, <A HREF="ext2fs_3.html">next</A> section.<P>

View File

@@ -0,0 +1,38 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_3.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Groups</TITLE>
<P>Go to the <A HREF="ext2fs_2.html">previous</A>, <A HREF="ext2fs_4.html">next</A> section.<P>
<A NAME="IDX14"></A>
<H1><A NAME="SEC3" HREF="ext2fs_toc.html#SEC3">Groups</A></H1>
<A NAME="IDX15"></A>
<A NAME="IDX16"></A>
<P>
The blocks on disk are divided into groups. Each of these groups
duplicates critical information of the file system. Moreover, the
presence of block groups on disk allow the use of efficient disk
allocation algorithms.
<A NAME="IDX17"></A>
<A NAME="IDX18"></A>
<P>
Each group contains in that order:
<P>
<UL>
<LI>the superblock. See section <A HREF="ext2fs_4.html#SEC4">Superblock</A>
<P>
<LI>the group descriptors. See section <A HREF="ext2fs_5.html#SEC5">Group Descriptors</A>
<P>
<LI>the block bitmap of the group. See section <A HREF="ext2fs_6.html#SEC6">Bitmaps</A>
<P>
<LI>the inode bitmap of the group.
<P>
<LI>the inode table of the group. See section <A HREF="ext2fs_7.html#SEC7">Inodes</A>
<P>
<LI>the data blocks in the group. See section <A HREF="ext2fs_2.html#SEC2">Blocks and Fragments</A>
</UL>
<P>
The superblock and group descriptors of each group must carry the same
values on disk.
<P>Go to the <A HREF="ext2fs_2.html">previous</A>, <A HREF="ext2fs_4.html">next</A> section.<P>

View File

@@ -0,0 +1,231 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_4.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Superblock</TITLE>
<P>Go to the <A HREF="ext2fs_3.html">previous</A>, <A HREF="ext2fs_5.html">next</A> section.<P>
<H1><A NAME="SEC4" HREF="ext2fs_toc.html#SEC4">Superblock</A></H1>
<P>
In this section, the layout of a superblock is described. Here is the
official structure of an ext2fs superblock [include/linux/ext2_fs.h]:
<P>
<PRE>
struct ext2_super_block {
unsigned long s_inodes_count;
unsigned long s_blocks_count;
unsigned long s_r_blocks_count;
unsigned long s_free_blocks_count;
unsigned long s_free_inodes_count;
unsigned long s_first_data_block;
unsigned long s_log_block_size;
long s_log_frag_size;
unsigned long s_blocks_per_group;
unsigned long s_frags_per_group;
unsigned long s_inodes_per_group;
unsigned long s_mtime;
unsigned long s_wtime;
unsigned short s_mnt_count;
short s_max_mnt_count;
unsigned short s_magic;
unsigned short s_state;
unsigned short s_errors;
unsigned short s_pad;
unsigned long s_lastcheck;
unsigned long s_checkinterval;
unsigned long s_reserved[238];
};
</PRE>
<P>
<DL COMPACT>
<DT><CODE>s_inodes_count</CODE>
<DD>the total number of inodes on the fs.
<P>
<DT><CODE>s_blocks_count</CODE>
<DD>the total number of blocks on the fs.
<P>
<DT><CODE>s_r_blocks_count</CODE>
<DD>the total number of blocks reserved for the exclusive use of the
superuser.
<P>
<DT><CODE>s_free_blocks_count</CODE>
<DD>the total number of free blocks on the fs.
<P>
<DT><CODE>s_free_inodes_count</CODE>
<DD>the total number of free inodes on the fs.
<P>
<DT><CODE>s_first_data_block</CODE>
<DD>the position on the fs of the first data block. Usually, this is block
number 1 for fs containing 1024 bytes blocks and is number 0 for other
fs.
<P>
<DT><CODE>s_log_block_size</CODE>
<DD>used to compute the logical block size in bytes. The logical block size
is in fact <CODE>1024 &#60;&#60; s_log_block_size</CODE>.
<P>
<DT><CODE>s_log_frag_size</CODE>
<DD>used to compute the logical fragment size. The logical fragment size is
in fact <CODE>1024 &#60;&#60; s_log_frag_size</CODE> if <CODE>s_log_frag_size</CODE> is positive
and <CODE>1024 &#62;&#62; -s_log_frag_size</CODE> if <CODE>s_log_frag_size</CODE> is negative.
<P>
<DT><CODE>s_blocks_per_group</CODE>
<DD>the total number of blocks contained in a group.
<P>
<DT><CODE>s_frags_per_group</CODE>
<DD>the total number of fragments contained in a group.
<P>
<DT><CODE>s_inodes_per_group</CODE>
<DD>the total number of inodes contained in a group.
<P>
<DT><CODE>s_mtime</CODE>
<DD>the time at which the last mount of the fs was performed.
<P>
<DT><CODE>s_wtime</CODE>
<DD>the time at which the last write of the superblock on the fs was performed.
<P>
<DT><CODE>s_mnt_count</CODE>
<DD>the number of time the fs has been mounted in read-write mode without having
been checked.
<P>
<DT><CODE>s_max_mnt_count</CODE>
<DD>the maximum number of time the fs may be mounted in read-write mode before a
check must be done.
<P>
<DT><CODE>s_magic</CODE>
<DD>a magic number that permits the identification of the file system. It is
<CODE>0xEF53</CODE> for a normal ext2fs and <CODE>0xEF51</CODE> for versions of
ext2fs prior to 0.2b.
<P>
<DT><CODE>s_state</CODE>
<DD>the state of the file system. It contains an or'ed value of EXT2_VALID_FS
(0x0001) which means: unmounted cleanly; and EXT2_ERROR_FS (0x0002) which
means: errors detected by the kernel code.
<P>
<DT><CODE>s_errors</CODE>
<DD>indicates what operation to perform when an error occurs. See section <A HREF="ext2fs_10.html#SEC10">Error Handling</A>
<P>
<DT><CODE>s_pad</CODE>
<DD>unused.
<P>
<DT><CODE>s_lastcheck</CODE>
<DD>the time of the last check performed on the fs.
<P>
<DT><CODE>s_checkinterval</CODE>
<DD>the maximum possible time between checks on the fs.
<P>
<DT><CODE>s_reserved</CODE>
<DD>unused.
</DL>
<A NAME="IDX19"></A>
<P>
Times are measured in seconds since 00:00:00 GMT, January 1, 1970.
<P>
Once the superblock is read in memory, the ext2fs kernel code calculates
some other information and keeps them in another structure. This structure
has the following layout:
<P>
<PRE>
struct ext2_sb_info {
unsigned long s_frag_size;
unsigned long s_frags_per_block;
unsigned long s_inodes_per_block;
unsigned long s_frags_per_group;
unsigned long s_blocks_per_group;
unsigned long s_inodes_per_group;
unsigned long s_itb_per_group;
unsigned long s_desc_per_block;
unsigned long s_groups_count;
struct buffer_head * s_sbh;
struct ext2_super_block * s_es;
struct buffer_head * s_group_desc[EXT2_MAX_GROUP_DESC];
unsigned short s_loaded_inode_bitmaps;
unsigned short s_loaded_block_bitmaps;
unsigned long s_inode_bitmap_number[EXT2_MAX_GROUP_LOADED];
struct buffer_head * s_inode_bitmap[EXT2_MAX_GROUP_LOADED];
unsigned long s_block_bitmap_number[EXT2_MAX_GROUP_LOADED];
struct buffer_head * s_block_bitmap[EXT2_MAX_GROUP_LOADED];
int s_rename_lock;
struct wait_queue * s_rename_wait;
unsigned long s_mount_opt;
unsigned short s_mount_state;
};
</PRE>
<P>
<DL COMPACT>
<DT><CODE>s_frag_size</CODE>
<DD>fragment size in bytes.
<P>
<DT><CODE>s_frags_per_block</CODE>
<DD>number of fragments in a block.
<P>
<DT><CODE>s_inodes_per_block</CODE>
<DD>number of inodes in a block of the inode table.
<P>
<DT><CODE>s_frags_per_group</CODE>
<DD>number of fragments in a group.
<P>
<DT><CODE>s_blocks_per_group</CODE>
<DD>number of blocks in a group.
<P>
<DT><CODE>s_inodes_per_group</CODE>
<DD>number of inodes in a group.
<P>
<DT><CODE>s_itb_per_group</CODE>
<DD>number of inode table blocks per group.
<P>
<DT><CODE>s_desc_per_block</CODE>
<DD>number of group descriptors per block.
<P>
<DT><CODE>s_groups_count</CODE>
<DD>number of groups.
<P>
<DT><CODE>s_sbh</CODE>
<DD>the buffer containing the disk superblock in memory.
<P>
<DT><CODE>s_es</CODE>
<DD>pointer to the superblock in the buffer.
<P>
<DT><CODE>s_group_desc</CODE>
<DD>pointers to the buffers containing the group descriptors.
<P>
<DT><CODE>s_loaded_inode_bitmaps</CODE>
<DD>number of inodes bitmap cache entries used.
<P>
<DT><CODE>s_loaded_block_bitmaps</CODE>
<DD>number of blocks bitmap cache entries used.
<P>
<DT><CODE>s_inode_bitmap_number</CODE>
<DD>indicates to which group the inodes bitmap in the buffers belong.
<P>
<DT><CODE>s_inode_bitmap</CODE>
<DD>inode bitmap cache.
<P>
<DT><CODE>s_block_bitmap_number</CODE>
<DD>indicates to which group the blocks bitmap in the buffers belong.
<P>
<DT><CODE>s_block_bitmap</CODE>
<DD>block bitmap cache.
<P>
<DT><CODE>s_rename_lock</CODE>
<DD>lock used to avoid two simultaneous rename operations on a fs.
<P>
<DT><CODE>s_rename_wait</CODE>
<DD>wait queue used to wait for the completion of a rename operation in progress.
<P>
<DT><CODE>s_mount_opt</CODE>
<DD>the mounting options specified by the administrator.
<P>
<DT><CODE>s_mount_state</CODE>
<DD></DL>
<P>
Most of those values are computed from the superblock on disk.
<A NAME="IDX20"></A>
<A NAME="IDX21"></A>
<P>
Linux ext2fs manager caches access to the inodes and blocks
bitmaps. This cache is a list of buffers ordered from the most recently
used to the last recently used buffer. Managers should use the same kind
of bitmap caching or other similar method of improving access time to
disk.
<P>
<P>Go to the <A HREF="ext2fs_3.html">previous</A>, <A HREF="ext2fs_5.html">next</A> section.<P>

View File

@@ -0,0 +1,53 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_5.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Group Descriptors</TITLE>
<P>Go to the <A HREF="ext2fs_4.html">previous</A>, <A HREF="ext2fs_6.html">next</A> section.<P>
<H1><A NAME="SEC5" HREF="ext2fs_toc.html#SEC5">Group Descriptors</A></H1>
<P>
On disk, the group descriptors immediately follow the superblock and
each descriptor has the following layout:
<P>
<PRE>
struct ext2_group_desc
{
unsigned long bg_block_bitmap;
unsigned long bg_inode_bitmap;
unsigned long bg_inode_table;
unsigned short bg_free_blocks_count;
unsigned short bg_free_inodes_count;
unsigned short bg_used_dirs_count;
unsigned short bg_pad;
unsigned long bg_reserved[3];
};
</PRE>
<P>
<DL COMPACT>
<DT><CODE>bg_block_bitmap</CODE>
<DD>points to the blocks bitmap block for the group.
<P>
<DT><CODE>bg_inode_bitmap</CODE>
<DD>points to the inodes bitmap block for the group.
<P>
<DT><CODE>bg_inode_table</CODE>
<DD>points to the inodes table first block.
<P>
<DT><CODE>bg_free_blocks_count</CODE>
<DD>number of free blocks in the group.
<P>
<DT><CODE>bg_free_inodes_count</CODE>
<DD>number of free inodes in the group.
<P>
<DT><CODE>bg_used_dirs_count</CODE>
<DD>number of inodes allocated to directories in the group.
<P>
<DT><CODE>bg_pad</CODE>
<DD>padding.
</DL>
<P>
The information in a group descriptor pertains only to the group it is
actually describing.
<P>
<P>Go to the <A HREF="ext2fs_4.html">previous</A>, <A HREF="ext2fs_6.html">next</A> section.<P>

View File

@@ -0,0 +1,38 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_6.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Bitmaps</TITLE>
<P>Go to the <A HREF="ext2fs_5.html">previous</A>, <A HREF="ext2fs_7.html">next</A> section.<P>
<A NAME="IDX22"></A>
<H1><A NAME="SEC6" HREF="ext2fs_toc.html#SEC6">Bitmaps</A></H1>
<P>
The ext2 file system uses bitmaps to keep track of allocated blocks
and inodes.
<A NAME="IDX23"></A>
<A NAME="IDX24"></A>
<P>
The blocks bitmap of each group refers to blocks ranging from the first
block in the group to the last block in the group. To access the bit of
a precise block, we first have to look for the group the block belongs
to and then look for the bit of this block in the blocks bitmap
contained in the group. It it very important to note that the blocks
bitmap refer in fact to the smallest allocation unit supported by the
file system: fragments. Since the block size is always a multiple of
fragment size, when the file system manager allocates a block, it
actually allocates a multiple number of fragments. This use of the
blocks bitmap permits to the file system manager to allocate and
deallocate space on a fragment basis.
<A NAME="IDX25"></A>
<A NAME="IDX26"></A>
<P>
The inode bitmap of each group refer to inodes ranging from the first
inode of the group to the last inode of the group. To access the bit of
a precise inode, we first have to look for the group the inode belongs
to and then look for the bit of this inode in the inode bitmap contained
in the group. To obtain the inode information from the inode table, the
process is the same, except that the final search is in the inode table
of the group instead of the inode bitmap.
<P>
<P>Go to the <A HREF="ext2fs_5.html">previous</A>, <A HREF="ext2fs_7.html">next</A> section.<P>

View File

@@ -0,0 +1,180 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_7.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Inodes</TITLE>
<P>Go to the <A HREF="ext2fs_6.html">previous</A>, <A HREF="ext2fs_8.html">next</A> section.<P>
<A NAME="IDX27"></A>
<H1><A NAME="SEC7" HREF="ext2fs_toc.html#SEC7">Inodes</A></H1>
<A NAME="IDX28"></A>
<A NAME="IDX29"></A>
<A NAME="IDX30"></A>
<A NAME="IDX31"></A>
<A NAME="IDX32"></A>
<A NAME="IDX33"></A>
<P>
An inode uniquely describes a file. Here's what an inode looks like on
disk:
<P>
<PRE>
struct ext2_inode {
unsigned short i_mode;
unsigned short i_uid;
unsigned long i_size;
unsigned long i_atime;
unsigned long i_ctime;
unsigned long i_mtime;
unsigned long i_dtime;
unsigned short i_gid;
unsigned short i_links_count;
unsigned long i_blocks;
unsigned long i_flags;
unsigned long i_reserved1;
unsigned long i_block[EXT2_N_BLOCKS];
unsigned long i_version;
unsigned long i_file_acl;
unsigned long i_dir_acl;
unsigned long i_faddr;
unsigned char i_frag;
unsigned char i_fsize;
unsigned short i_pad1;
unsigned long i_reserved2[2];
};
</PRE>
<P>
<DL COMPACT>
<DT><CODE>i_mode</CODE>
<DD>type of file (character, block, link, etc.) and access rights on the
file.
<P>
<DT><CODE>i_uid</CODE>
<DD>uid of the owner of the file.
<P>
<DT><CODE>i_size</CODE>
<DD>logical size in bytes.
<P>
<DT><CODE>i_atime</CODE>
<DD>last time the file was accessed.
<P>
<DT><CODE>i_ctime</CODE>
<DD>last time the inode information of the file was changed.
<P>
<DT><CODE>i_mtime</CODE>
<DD>last time the file content was modified.
<P>
<DT><CODE>i_dtime</CODE>
<DD>when this file was deleted.
<P>
<DT><CODE>i_gid</CODE>
<DD>gid of the file.
<P>
<DT><CODE>i_links_count</CODE>
<DD>number of links pointing to this file.
<P>
<DT><CODE>i_blocks</CODE>
<DD>number of blocks allocated to this file counted in 512 bytes units.
<P>
<DT><CODE>i_flags</CODE>
<DD>flags (see below).
<P>
<DT><CODE>i_reserved1</CODE>
<DD>reserved.
<P>
<DT><CODE>i_block</CODE>
<DD>pointers to blocks (see below).
<P>
<DT><CODE>i_version</CODE>
<DD>version of the file (used by NFS).
<P>
<DT><CODE>i_file_acl</CODE>
<DD>control access list of the file (not used yet).
<P>
<DT><CODE>i_dir_acl</CODE>
<DD>control access list of the directory (not used yet).
<P>
<DT><CODE>i_faddr</CODE>
<DD>block where the fragment of the file resides.
<P>
<DT><CODE>i_frag</CODE>
<DD>number of the fragment in the block.
<P>
<DT><CODE>i_size</CODE>
<DD>size of the fragment.
<P>
<DT><CODE>i_pad1</CODE>
<DD>padding.
<P>
<DT><CODE>i_reserved2</CODE>
<DD>reserved.
</DL>
<P>
As you can see, the inode contains, <CODE>EXT2_N_BLOCKS</CODE> (15 in ext2fs
0.5) pointers to block. Of theses pointers, the first
<CODE>EXT2_NDIR_BLOCKS</CODE> (12) are direct pointers to data. The following entry
points to a block of pointers to data (indirect). The following entry
points to a block of pointers to blocks of pointers to data (double
indirection). The following entry points to a block of pointers to a
block of pointers to a block of pointers to data (triple indirection).
<P>
The inode flags may take one or more of the following or'ed values:
<P>
<DL COMPACT>
<DT><CODE>EXT2_SECRM_FL 0x0001</CODE>
<DD>secure deletion. This usually means that when this flag is set and we
delete the file, random data is written in the blocks previously allocated
to the file.
<P>
<DT><CODE>EXT2_UNRM_FL 0x0002</CODE>
<DD>undelete. When this flag is set and the file is being deleted, the file
system code must store enough information to ensure the undeletion of
the file (to a certain extent).
<P>
<DT><CODE>EXT2_COMPR_FL 0x0004</CODE>
<DD>compress file. The content of the file is compressed, the file system
code must use compression/decompression algorithms when accessing the
data of this file.
<P>
<DT><CODE>EXT2_SYNC_FL 0x0008</CODE>
<DD>synchronous updates. The disk representation of this file must be kept
in sync with it's in core representation. Asynchronous I/O on this kind
of file is not possible. The synchronous updates only apply to the inode
itself and to the indirect blocks. Data blocks are always written
asynchronously on the disk.
</DL>
<A NAME="IDX34"></A>
<A NAME="IDX35"></A>
<A NAME="IDX36"></A>
<A NAME="IDX37"></A>
<A NAME="IDX38"></A>
<A NAME="IDX39"></A>
<A NAME="IDX40"></A>
<A NAME="IDX41"></A>
<A NAME="IDX42"></A>
<P>
Some inodes have a special meaning:
<P>
<DL COMPACT>
<DT><CODE>EXT2_BAD_INO 1</CODE>
<DD>a file containing the list of bad blocks on the file system.
<P>
<DT><CODE>EXT2_ROOT_INO 2</CODE>
<DD>the root directory of the file system.
<P>
<DT><CODE>EXT2_ACL_IDX_INO 3</CODE>
<DD>ACL inode.
<P>
<DT><CODE>EXT2_ACL_DATA_INO 4</CODE>
<DD>ACL inode.
<P>
<DT><CODE>EXT2_BOOT_LOADER_INO 5</CODE>
<DD>the file containing the boot loader. (Not used yet it seems.)
<P>
<DT><CODE>EXT2_UNDEL_DIR_INO 6</CODE>
<DD>the undelete directory of the system.
<P>
<DT><CODE>EXT2_FIRST_INO 11</CODE>
<DD>this is the first inode that does not have a special meaning.
</DL>
<P>
<P>Go to the <A HREF="ext2fs_6.html">previous</A>, <A HREF="ext2fs_8.html">next</A> section.<P>

View File

@@ -0,0 +1,50 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_8.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Directories</TITLE>
<P>Go to the <A HREF="ext2fs_7.html">previous</A>, <A HREF="ext2fs_9.html">next</A> section.<P>
<A NAME="IDX43"></A>
<H1><A NAME="SEC8" HREF="ext2fs_toc.html#SEC8">Directories</A></H1>
<A NAME="IDX44"></A>
<P>
Directories are special files that are used to create access path to
the files on disk. It is very important to understand that an inode may
have many access paths. Since the directories are essential part of the
file system, they have a specific structure. A directory file is a list
of entries of the following format:
<P>
<PRE>
struct ext2_dir_entry {
unsigned long inode;
unsigned short rec_len;
unsigned short name_len;
char name[EXT2_NAME_LEN];
};
</PRE>
<P>
<DL COMPACT>
<DT><CODE>inode</CODE>
<DD>points to the inode of the file.
<P>
<DT><CODE>rec_len</CODE>
<DD>length of the entry record.
<P>
<DT><CODE>name_len</CODE>
<DD>length of the file name.
<P>
<DT><CODE>name</CODE>
<DD>name of the file. This name may have a maximum length of
<CODE>EXT2_NAME_LEN</CODE> bytes (255 bytes as of version 0.5).
</DL>
<A NAME="IDX45"></A>
<A NAME="IDX46"></A>
<A NAME="IDX47"></A>
<P>
There is such an entry in the directory file for each file in the
directory. Since ext2fs is a Unix file system the first two entries in
the directory are file <SAMP>`.'</SAMP> and <SAMP>`..'</SAMP> which points to the
current directory and the parent directory respectively.
<P>
<P>Go to the <A HREF="ext2fs_7.html">previous</A>, <A HREF="ext2fs_9.html">next</A> section.<P>

View File

@@ -0,0 +1,39 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_9.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Allocation algorithms</TITLE>
<P>Go to the <A HREF="ext2fs_8.html">previous</A>, <A HREF="ext2fs_10.html">next</A> section.<P>
<H1><A NAME="SEC9" HREF="ext2fs_toc.html#SEC9">Allocation algorithms</A></H1>
<P>
Here are the allocation algorithms that ext2 file system managers
<STRONG>must</STRONG> use. We are adamant on this point. Nowadays, many users
use more than one operating system on the same computer. If more than
one operating system use the same ext2 partition, they have to use the
same allocation algorithms. If they do otherwise, what will happen is
that one file system manager will undo the work of the other file system
manager. It is useless to have a manager that uses highly efficient
allocation algorithms if the other one does not bother with allocation
and uses quick and dirty algorithms.
<P>
Here are the rules used to allocate new inodes:
<P>
<UL>
<LI>the inode for a new file is allocated in the same group of the
inode of its parent directory.
<P>
<LI>inodes are allocated equally between groups.
</UL>
<P>
Here are the rules used to allocate new blocks:
<P>
<UL>
<LI>a new block is allocated in the same group as its inode.
<P>
<LI>allocate consecutive sequences of blocks.
</UL>
<P>
Of course, it may be sometimes impossible to abide by those rules. In
this case, the manager may allocate the block or inode anywhere.
<P>Go to the <A HREF="ext2fs_8.html">previous</A>, <A HREF="ext2fs_10.html">next</A> section.<P>

View File

@@ -0,0 +1,29 @@
<!-- X-URL: http://step.polymtl.ca/~ldd/ext2fs/ext2fs_toc.html -->
<!-- This HTML file has been created by texi2html 1.29
from ext2fs.texi on 3 August 1994 -->
<TITLE>Analysis of the Ext2fs structure - Table of Contents</TITLE>
<H1>Analysis of the Ext2fs structure</H1>
<ADDRESS>Louis-Dominique Dubeau</ADDRESS>
<P>
<UL>
<LI><A NAME="SEC1" HREF="ext2fs_1.html#SEC1">Introduction</A>
<LI><A NAME="SEC2" HREF="ext2fs_2.html#SEC2">Blocks and Fragments</A>
<LI><A NAME="SEC3" HREF="ext2fs_3.html#SEC3">Groups</A>
<LI><A NAME="SEC4" HREF="ext2fs_4.html#SEC4">Superblock</A>
<LI><A NAME="SEC5" HREF="ext2fs_5.html#SEC5">Group Descriptors</A>
<LI><A NAME="SEC6" HREF="ext2fs_6.html#SEC6">Bitmaps</A>
<LI><A NAME="SEC7" HREF="ext2fs_7.html#SEC7">Inodes</A>
<LI><A NAME="SEC8" HREF="ext2fs_8.html#SEC8">Directories</A>
<LI><A NAME="SEC9" HREF="ext2fs_9.html#SEC9">Allocation algorithms</A>
<LI><A NAME="SEC10" HREF="ext2fs_10.html#SEC10">Error Handling</A>
<LI><A NAME="SEC11" HREF="ext2fs_11.html#SEC11">Formulae</A>
<LI><A NAME="SEC12" HREF="ext2fs_12.html#SEC12">Invariants</A>
<UL>
<LI><A NAME="SEC13" HREF="ext2fs_13.html#SEC13">File Invariants</A>
<LI><A NAME="SEC14" HREF="ext2fs_14.html#SEC14">File System Invariants</A>
</UL>
<LI><A NAME="SEC15" HREF="ext2fs_15.html#SEC15">References</A>
<LI><A NAME="SEC16" HREF="ext2fs_16.html#SEC16">Concept Index</A>
</UL>

Binary file not shown.

View File

@@ -0,0 +1,169 @@
By: Inbar Raz
--------------------------------------------------------------------
The FAT is a linked-list table that DOS uses to keep track of the physical
position of data on a disk and for locating free space for storing new files.
The word at offset 1aH in a directory entry is a cluster number of the first
cluster in an allocation chain. If you locate that cell in the FAT, it will
either indicate the end of the chain or the next cell, etc. Observe:
starting cluster number --|
Directory +-------------------+-+-------------------+---+---+-+-+-------+
Entry -- |M Y F I L E T X T|a| |tim|dat|08 | size |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-|-+-+-+-+-+
+-------------------------+
00 01 02 03 04 05 06 07 |8 09 0a 0b 0c 0d 0e 0f
+--++--++--++--++--++--++--++--++-++--++--++--++--++--++--++--+
00 |ID||ff||03-04-05-ff||00||00||09-0a-0b-15||00||00||00||00|
+--++--++--++--++--++--++--++--++--++--++--++|-++--++--++--++--+
+-----------------------+
+--++--++--++--++--++-++--++--++--++--++--++--++--++--++--++--+
10 |00||00||00||00||00||16-17-19||f7||1a-1b-ff||00||00||00||00|
+--++--++--++--++--++--++--++|-++--++-++--++--++--++--++--++--+
+-------+
This diagram illustrates the main concepts of reading the FAT. In it:
<EFBFBD> The file MYFILE.TXT is 10 clusters long. The first byte is in cluster 08
and the last is in cluster 1bH. The chain is 8,9,0a,0b,15,16,17,19,1a,1b.
Each entry indicates the next entry in the chain, with a special code in
the last entry.
<EFBFBD> Cluster 18H is marked bad and is not part of any allocation chain.
<EFBFBD> Clusters 6,7, 0cH-14H, and 1cH-1fH are empty and available for allocation.
<EFBFBD> Another chain starts at cluster 2 and ends at cluster 5.
+-----------+
| FAT Facts | The FAT normally starts at logical sector 1 in the DOS partition
+-----------+ (eg, you can read it with INT 25H with DX=1). The only way to
be sure is to read the boot sector (DX=0), and examine offset 0eH. This
tells how many boot and reserved sectors come before the FAT. Use that
number (usually 1) in DX to read the FAT via INT 25H.
There may be more than one copy of the FAT. There are usually two complete
copies. If there are two or more, they will all be adjacent (the second FAT
directly follows the first).
You have the following services available to help you determine information
about the FAT:
<20> Use INT 25H to read the Boot Sector and examine the data fields therein
<20> Use DOS Fn 36H or 1cH to determine total disk sectors and clusters
<20> Use DOS Fn 44H (if the device driver supports Generic IOCTL) DOS 3.2
<20> Use DOS Fn 32H to get all kinds of useful information. UNDOCUMENTED
Note: The boot sector of non-booting disks (such as network block devices
and very old hard disks) may contain nothing but garbage.
+---------------+
| 12-bit/16-bit | The FAT can be laid out in 12-bit or 16-bit entries. 12-bit
+---------------+ entries are very efficient for media less than 384K--the
entire FAT can fit in a single 512-byte disk sector. For larger media, each
FAT entry must map to a larger and larger cluster size--to the point where a
20M hard disk would need to allocate in units of 16 sectors in order to use
the 12-bit format (in other words, a 1-byte file would take up a full 8K
cluster of a disk).
16-bit FAT entries were introduced with DOS 3.0 with the necessity of
efficient handling the AT's 20-Megabyte hard disk. However, floppy disks
and 10M hard disks continue to use the 12-bit layout. You can determine if
the FAT is laid out with 12-bit or 16-bit elements:
DOS 3.0 says: If a disk has more than 4086 (0ff6H) clusters, it uses 16 bits
(4096 is max value for a 12-bit number and >0ff6H is reserved)
DOS 3.2 says: If a disk has more than 20740 (5104H) SECTORS, it uses 16 bits
(in other words, any disk over 10 Megabytes uses a 16-bit FAT
and all others--including large RAM disks--use 12-bits).
Note: It's a common misconception that the 16-bit FAT allows DOS to work with
disks larger than 32 Megabytes. In fact, the limiting factor is that
INT 25H/26H (through which DOS performs its disk I/O) in unable to
access a SECTOR number higher than 65535. Normally, sectors are 512
bytes (<28>-K), so that sets the 32M limit.
In DOS 4.0, INT 25H/26H supports a technique for accessing sector
numbers
higher than 65535, and thus supports trans-32M DOS partitions. This
has no effect on the layout of the FAT itself. Using 16-bit FAT
entries and 4-sector clusters, DOS now supports partitions up to 134M
(twice that for 8-sector clusters, etc.).
+-----------------+
| Reading the FAT | To read the value of any entry in a FAT (as when following
+-----------------+ a FAT chain), first read the entire FAT into memory and
obtain a starting cluster number from a directory. Then, for 12-bit entries:
==============
<EFBFBD> Multiply the cluster number by 3 =|
<EFBFBD> Divide the result by 2 =========+= (each entry is 1.5 (3/2) bytes long)
<EFBFBD> Read the WORD at the resulting address (as offset from the start of the FAT)
<EFBFBD> If the cluster was even, mask the value by 0fffH (keep the low 12 bits)
If the cluster number was odd, shift the value right by 4 bits (keep the
upper 12 bits)
<EFBFBD> The result is the entry for the next cluster in the chain (0fffH=the end).
Note: A 12-bit entry can cross over a sector boundary, so be careful with
1-sector FAT buffering schemes.
16-bit entries are simpler--each entry contains the 16-bit offset (from the
start of the FAT) of the next entry in the chain (0ffffH indicates the end).
+-------------+
| FAT Content | The first byte of the FAT is called the Media Descriptor or
+-------------+ FAT ID byte. The next 5 bytes (12-bit FATs) or 7 bytes
(16-bit FATs) are 0ffH. The rest of the FAT is composed of 12-bit or 16-bit
cells that each represent one disk cluster. These cells will contain one of
the following values:
<20> (0)000H ................... an available cluster
<20> (f)ff0H through (f)ff7H ... a reserved cluster
<20> (f)ff7H ................... a bad cluster
<20> (f)ff8H through (f)fffH ... the end of an allocation chain
<20> (0)002H through (f)fefH ... the number of the next cluster in a chain
Note: the high nibble of the value is used only in 16-bit FATs; eg, a bad
cluster is marked with 0ff7H in 12-bit FATs, and fff7H with 16-bit FATs.
+------------------------------------------------+
| Converting a Cluster Number to a Sector Number | After you obtain a file's
+------------------------------------------------+ starting cluster number
from a directory entry you will want to locate to actual disk sector that
holds the file (or subdirectory) data.
A diskette (or a DOS partition of a hard disk) is laid out like so:
<20> Boot and reserved sector(s)
<20> FAT #1
<20> FAT #2 (optional -- not used on RAM disks)
<20> root directory
<20> data area (all file data reside here, including files for directories)
Every section of this layout is variable and the sizes of each section must
be known in order to perform a correct cluster-to-sector conversion. The
following formulae represent the only documented method of determining a DOS
logical sector number from a cluster number:
RootDirSectors = sectorBytes / (rootDirEntries * 32)
FatSectors = fatCount * sectorsPerFat
DataStart = reservedSectors + fatSectors + rootDirSectors
INT 25h/26h Sector = DataStart + ((AnyClusterNumber-2) * sectorsPerCluster)
Where the variables:
sectorBytes sectorsPerFat fatCount
rootDirEntries reservedSectors sectorsPerCluster
are obtained from the Boot Sector or from a BPB (if you can get access). The
resulting sector number can be used in DX for INT 25H/26H DOS absolute disk
access.
If you are a daring sort of person, you can save trouble by using the
undocumented DOS Fn 32H (Get Disk Info) which provides a package of pre-
calculated data, including the sector number of the start of file data (it
gives you "DataStart", in the above equation).
Author's note: The best use I've found for all this information is in
directory scanning; ie, to bypass the DOS file-searching services and read
directory sectors directly. For a program that must obtain a list of all
files and directories, direct access of directory sectors will work roughly
twice as fast as DOS Fns 4eH and 4fH.

Binary file not shown.

View File

@@ -0,0 +1,7 @@
<html>
<head>
<meta http-equiv="refresh" content="0;url=/Linux.old/sabre/os/articles">
</head>
<body lang="zh-CN">
</body>
</html>

View File

@@ -0,0 +1,531 @@
<HTML>
<HEAD>
<TITLE>ISO9660 Simplified for DOS/Windows</TITLE>
<META name="description"
content="A simplfied description of the ISO9660 file specification as
used on CD-ROM disks with DOS and Windows.">
</HEAD>
<BODY>
<H2>ISO9660 Simplified for DOS/Windows<BR>
by Philip J. Erdelsky</H2>
<H4>1. Introduction</H4>
<P>We weren't sure about it a few years ago, but by now it should be
clear to everyone that CD-ROM's are here to stay. Most PC's are equipped
with CD-ROM readers, and most major PC software packages are being
distributed on CD-ROM's.
<P>Under DOS (and Windows, which uses the DOS file system) files are
written to both hard and floppy disks with a so-called FAT (File
Allocation Table) file system.
<P>Files on a CD-ROM, however, are written to a different standard,
called ISO9660. ISO9660 is rather complex and poorly written, and
obviously contains a number of diplomatic compromises among advocates of
DOS, UNIX, MVS and perhaps other operating systems.
<P>The simplified version presented here includes only features that
would normally be found on a CD-ROM to be used in a DOS system and which
are supported by the Microsoft MS-DOS CD-ROM Extensions (MSCDEX). It is
based on ISO9660, on certain documents regarding MSCDEX (version 2.10),
and on the contents of some actual CD-ROM's.
<P>Where a field has a specific value on a CD-ROM to be used with DOS,
that value is given in this document. However, in some cases a brief
description of values for use with other operating systems is given in
square brackets.
<P>ISO9660 makes provisions for sets of CD-ROM's, and apparently even
permits a file system to span more than one CD-ROM. However, this
feature is not supported by MSCDEX.
<H4>2. Files</H4>
<P>The directory structure on a CD-ROM is almost exactly like that on a
DOS floppy or hard disk. (It is presumed that the reader of this
document is reasonably familiar with the DOS file system.) For this
reason, DOS and Windows applications can read files from a CD-ROM just
as they would from a floppy or hard disk.
<P>There are only a few differences, which do not affect most
applications:
<OL>
<LI>The root directory contains the notorious "." and ".." entries,
just like any other directory.
<LI>There is no limit, other than disk capacity, to the size of the
root directory.
<LI>The depth of directory nesting is limited to eight levels,
including the root. For example, if drive E: contains a CD-ROM,
a file such as E:\D2\D3\D4\D5\D6\D7\D8\FOO.TXT is permitted but
E:\D2\D3\D4\D5\D6\D7\D8\D9\FOO.TXT is not.
<LI>If a CD-ROM is to be used by a DOS system, file names and
extensions must be limited to eight and three characters,
respectively, even though ISO9660 permits longer names and
extensions.
<LI>ISO9660 permits only capital letters, digits and underscores in
a file or directory name or extension, but DOS also permits a
number of other punctuation marks.
<LI>ISO9660 permits a file to have an extension but not a name, but
DOS does not.
<LI>DOS permits a directory to have an extension, but ISO9660 does
not.
<LI>Directories on a CD-ROM are always sorted, as described below.
</OL>
<P>Of course, neither DOS, nor UNIX, nor any other operating system can
WRITE files to a CD-ROM as it would to a floppy or hard disk, because a
CD-ROM is not rewritable. Files must be written to the CD-ROM by a
special program with special hardware.
<H4>3. Sectors</H4>
<P>The information on a CD-ROM is divided into sectors, which are
numbered consecutively, starting with zero. There are no gaps in the
numbering.
<P>Each sector contains 2048 8-bit bytes. (ISO9660 apparently permits
other sector sizes, but the 2048-byte size seems to be universal.)
<P>When a number of sectors are to be read from the CD-ROM, they should
be read in order of increasing sector number, if possible, since that is
the order in which they pass under the read head as the CD-ROM rotates.
Most implementations arrange the information so sectors will be read in
this order for typical file operations, although ISO9660 does not
require this in all cases.
<P>The order of bytes within a sector is considered to be the order in
which they appear when read into memory; i.e., the "first" bytes are
read into the lowest memory addresses. This is also the order used in
this document; i.e., the "first" bytes in any list appear at the top of
the list.
<H4>4. Character Sets</H4>
<P>Names and extensions of files and directories, the volume name, and
some other names are expressed in standard ASCII character codes
(although ISO9660 does not use the name ASCII). According to ISO9660,
only capital letters, digits, and underscores are permitted. However,
DOS permits some other punctuation marks, which are sometimes found on
CD-ROM's, in apparent defiance of ISO9660.
<P>MSCDEX does offer support for the kanji (Japanese) character set.
However, this document does not cover kanji.
<H4>5. Sorting Names or Extensions</H4>
<P>Where ISO9660 requires file or directory names or extensions to be
sorted, the usual ASCII collating sequence is used. That is, two
different names or extensions are compared as follows:
<OL>
<LI>ASCII blanks (32) are added to the right end of the shorter
name or extension, if necessary, to make it as long as the
longer name or extension.
<LI>The first (leftmost) position in which the names or extensions
are not identical determines the order. The name or extension
with the lower ASCII code in that position appears first in the
sorted order.
</OL>
<H4>6. Multiple-Byte Values</H4>
<P>A 16-bit numeric value (usually called a word) may be represented on
a CD-ROM in any of three ways:
<DL>
<DT>Little Endian Word: <DD>The value occupies two consecutive bytes, with
the less significant byte first.
<DT>Big Endian Word: <DD>The value occupies two consecutive bytes, with
the more significant byte first.
<DT>Both Endian Word: <DD>The value occupies FOUR consecutive bytes; the
first and second bytes contain the value expressed as a little
endian word, and the third and fourth bytes contain the same
value expressed as a big endian word.
</DL>
<P>A 32-bit numeric value (usually called a double word) may be
represented on a CD-ROM in any of three ways:
<DL>
<DT>Little Endian Double Word: <DD>The value occupies four consecutive
bytes, with the least significant byte first and the other bytes
in order of increasing significance.
<DT>Big Endian Double Word: <DD>The value occupies four consecutive bytes,
with the most significant first and the other bytes in order of
decreasing significance.
<DT>Both Endian Double Word: <DD>The value occupies EIGHT consecutive
bytes; the first four bytes contain the value expressed as a
little endian double word, and the last four bytes contain the
same value expressed as a big endian double word.
</DL>
<H4>7. The First Sixteen Sectors are Empty</H4>
<P>The first sixteen sectors (sector numbers 0 to 15, inclusive) contain
nothing but zeros. ISO9660 does not define the contents of these
sectors, but for DOS they are apparently always written as zeros. They
are apparently reserved for use by systems that can be booted from a
CD-ROM.
<H4>8. The Volume Descriptors</H4>
<P>Sector 16 and a few of the following sectors contain a series of
volume descriptors. There are several kinds of volume descriptor, but
only two are normally used with DOS. Each volume descriptor occupies
exactly one sector.
<P>The last volume descriptors in the series are one or more Volume
Descriptor Set Terminators. The first seven bytes of a Volume Descriptor
Set Terminator are 255, 67, 68, 48, 48, 49 and 1, respectively. The
other 2041 bytes are zeros. (The middle bytes are the ASCII codes for
the characters CD001.)
<P>The only volume descriptor of real interest under DOS is the Primary
Volume Descriptor. There must be at least one, and there is usually only
one. However, some CD-ROM's have two or more identical Primary Volume
Descriptors. The contents of a Primary Volume Descriptor are as follows:
<pre>
length
in bytes contents
-------- ---------------------------------------------------------
1 1
6 67, 68, 48, 48, 49 and 1, respectively (same as Volume
Descriptor Set Terminator)
1 0
32 system identifier
32 volume identifier
8 zeros
8 total number of sectors, as a both endian double word
32 zeros
4 1, as a both endian word [volume set size]
4 1, as a both endian word [volume sequence number]
4 2048 (the sector size), as a both endian word
8 path table length in bytes, as a both endian double word
4 number of first sector in first little endian path table,
as a little endian double word
4 number of first sector in second little endian path table,
as a little endian double word, or zero if there is no
second little endian path table
4 number of first sector in first big endian path table,
as a big endian double word
4 number of first sector in second big endian path table,
as a big endian double word, or zero if there is no
second big endian path table
34 root directory record, as described below
128 volume set identifier
128 publisher identifier
128 data preparer identifier
128 application identifier
37 copyright file identifier
37 abstract file identifier
37 bibliographical file identifier
17 date and time of volume creation
17 date and time of most recent modification
17 date and time when volume expires
17 date and time when volume is effective
1 1
1 0
512 reserved for application use (usually zeros)
653 zeros
</pre>
<P>The first 11 characters of the volume identifier are returned as the
volume identifier by standard DOS system calls and utilities.
<P>Other identifiers are not used by DOS, and may be filled with ASCII
blanks (32).
<P>Each date and time field is of the following form:
<pre>
length
in bytes contents
-------- ---------------------------------------------------------
4 year, as four ASCII digits
2 month, as two ASCII digits, where
01=January, 02=February, etc.
2 day of month, as two ASCII digits, in the range
from 01 to 31
2 hour, as two ASCII digits, in the range from 00 to 23
2 minute, as two ASCII digits, in the range from 00 to 59
2 second, as two ASCII digits, in the range from 00 to 59
2 hundredths of a second, as two ASCII digits, in the range
from 00 to 99
1 offset from Greenwich Mean Time, in 15-minute intervals,
as a twos complement signed number, positive for time
zones east of Greenwich, and negative for time zones
west of Greenwich
</pre>
<P>If the date and time are not specified, the first 16 bytes are all
ASCII zeros (48), and the last byte is zero.
<P>Other kinds of Volume Descriptors (which are normally ignored by DOS)
have the following format:
<pre>
length
in bytes contents
-------- ---------------------------------------------------------
1 neither 1 nor 255
6 67, 68, 48, 48, 49 and 1, respectively (same as Volume
Descriptor Set Terminator)
2041 other things
</pre>
<H4>9. Path Tables</H4>
<P>The path tables normally come right after the volume descriptors.
However, ISO9660 merely requires that each path table begin in the
sector specified by the Primary Volume Descriptor.
<P>The path tables are actually redundant, since all of the information
contained in them is also stored elsewhere on the CD-ROM. However, their
use can make directory searches much faster.
<P>There are two kinds of path table -- a little endian path table, in
which multiple-byte values are stored in little endian order, and a big
endian path table, in which multiple-byte values are stored in big
endian order. The two kinds of path tables are identical in every other
way.
<P>A path table contains one record for each directory on the CD-ROM
(including the root directory). The format of a record is as follows:
<pre>
length
in bytes contents
-------- ---------------------------------------------------------
1 N, the name length (or 1 for the root directory)
1 0 [number of sectors in extended attribute record]
4 number of the first sector in the directory, as a
double word
2 number of record for parent directory (or 1 for the root
directory), as a word; the first record is number 1,
the second record is number 2, etc.
N name (or 0 for the root directory)
0 or 1 padding byte: if N is odd, this field contains a zero; if
N is even, this field is omitted
</pre>
<P>According to ISO9660, a directory name consists of at least one and
not more than 31 capital letters, digits and underscores. For DOS the
upper limit is eight characters.
<P>A path table occupies as many consecutive sectors as may be required
to hold all its records. The first record always begins in the first
byte of the first sector. Except for the single byte described above, no
padding is used between records; hence the last record in a sector is
usually continued in the next following sector. The unused part of the
last sector is filled with zeros.
<P>The records in a path table are arranged in a precisely specified
order. For this purpose, each directory has an associated number called
its level. The level of the root directory is 1. The level of each other
directory is one greater than the level of its parent. As noted above,
ISO9660 does not permit levels greater than 8.
<P>The relative positions of any two records are determined as follows:
<OL>
<LI>If the levels are different, the directory with the lower level
appears first. In particular, this implies that the root
directory is always represented by the first record in the
table, because it is the only directory with level 1.
<LI>If the levels are identical, but the directories have different
parents, then the directories are in the same relative
positions as their parents.
<LI>Directories with the same level and the same parent are
arranged in the order obtained by sorting on their names, as
described in Section 5.
</OL>
<H4>10. Directories</H4>
<P>A directory consists of a series of directory records in one or more
consecutive sectors. However, unlike path records, directory records may
not straddle sector boundaries. There may be unused space at the end of
each sector, which is filled with zeros.
<P>Each directory record represents a file or directory. Its format is
as follows:
<pre>
length
in bytes contents
-------- ---------------------------------------------------------
1 R, the number of bytes in the record (which must be even)
1 0 [number of sectors in extended attribute record]
8 number of the first sector of file data or directory
(zero for an empty file), as a both endian double word
8 number of bytes of file data or length of directory,
excluding the extended attribute record,
as a both endian double word
1 number of years since 1900
1 month, where 1=January, 2=February, etc.
1 day of month, in the range from 1 to 31
1 hour, in the range from 0 to 23
1 minute, in the range from 0 to 59
1 second, in the range from 0 to 59
(for DOS this is always an even number)
1 offset from Greenwich Mean Time, in 15-minute intervals,
as a twos complement signed number, positive for time
zones east of Greenwich, and negative for time zones
west of Greenwich (DOS ignores this field)
1 flags, with bits as follows:
bit value
------ ------------------------------------------
0 (LS) 0 for a norma1 file, 1 for a hidden file
1 0 for a file, 1 for a directory
2 0 [1 for an associated file]
3 0 [1 for record format specified]
4 0 [1 for permissions specified]
5 0
6 0
7 (MS) 0 [1 if not the final record for the file]
1 0 [file unit size for an interleaved file]
1 0 [interleave gap size for an interleaved file]
4 1, as a both endian word [volume sequence number]
1 N, the identifier length
N identifier
P padding byte: if N is even, P = 1 and this field contains
a zero; if N is odd, P = 0 and this field is omitted
R-33-N-P unspecified field for system use; must contain an even
number of bytes
</pre>
<P>The length of a directory includes the unused space, if any, at the
ends of sectors. Hence it is always an exact multiple of 2048 (the
sector size). Since every directory, even a nominally empty one,
contains at least two records, the length of a directory is never zero.
<P>All fields in the first record (sometimes called the "." record)
refer to the directory itself, except that the identifier length is 1,
and the identifier is zero. The root directory record in the Primary
Volume Descriptor also has this format.
<P>All fields in the second record (sometimes called the ".." record)
refer to the parent directory, except that the identifier length is 1,
and the identifier is 1. The second record in the root directory refers
to the root directory.
<P>The identifier for a subdirectory is its name. The identifier for a
file consists of the following fields, in the order given:
<OL>
<LI>The name, consisting of the ASCII codes for at least one and
not more than eight capital letters, digits and underscores.
<LI>If there is an extension, the ASCII code for a period (46). If
there is no extension, this field is omitted.
<LI>The extension, consisting of the ASCII codes for not more than
three capital letters, digits and underscores. If there is no
extension, this field is omitted.
<LI>The ASCII code for a semicolon (59).
<LI>The ASCII code for 1 (49). [On other systems, this is the
version number, consisting of the ASCII codes for a sequence of
digits representing a number between 1 and 32767, inclusive.]
</OL>
<P>Some implementations for DOS omit (4) and (5), and some use
punctuation marks other than underscores in file names and extensions.
<P>Directory records other than the first two are sorted as follows:
<OL>
<LI>Records are sorted by name, as described above.
<LI>Every series of records with the same name is sorted by
extension, as described above. For this purpose, a record
without an extension is sorted as though its extension
consisted of ASCII blanks (32).
<LI>[On other systems, every series of records with the same name
and extension is sorted in order of decreasing version number.]
<LI>[On other systems, two records with the same name, extension
and version number are permitted, if the first record is an
associated file.]
</OL>
<P>[ISO9660 permits names containing more than eight characters and
extensions containing more than three characters, as long as both of
them together contain no more than 30 characters.]
<P>It is apparently permissible under ISO9660 to use two or more
consecutive records to represent consecutive pieces of the same file.
Bit 7 of the flags byte is set in every record except the last one.
However, this technique seems pointless and is apparently not used. It
is not supported by MSCDEX.
<P>Interleaving is another technique that is apparently seldom used. It
is not supported by MSCDEX (version 2.10).
<H4>11. Arrangement of Directory and Data Sectors</H4>
<P>ISO9660 does not specify the order of directory or file sectors. It
merely requires that the first sector of each directory or file be in
the location specified by its directory record, and that the sectors for
directories and non-interleaved files be consecutive.
<P>However, most implementations arrange the directories so each
directory follows its parent, and the data sectors for the files in each
directory lie immediately after the directory and immediately before the
next following directory. This appears to be an efficient arrangement
for most applications.
<P>Some implementations go one step further and order the directories in
the same manner as the corresponding path table records.
<H4>12. Extended Attribute Records</H4>
<P>Extended attribute records contain file and directory information
used by operating systems other than DOS, such as permissions and
logical record lengths.
<P>A CD-ROM written for DOS normally does not contain any extended
attribute records.
<P>When reading a CD-ROM containing extended attribute records, early
versions of MSCDEX simply returned incorrect results. Later versions
learned to skip over extended attribute records.
<P>Philip J. Erdelsky<BR>
San Diego, California USA<BR>
<A HREF="mailto:pje@acm.org">pje@acm.org</A><BR>
<A HREF="http://www.alumni.caltech.edu/~pje/">
http://www.alumni.caltech.edu/~pje/</A><BR>
</BODY>
</HTML>

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.