801 lines
37 KiB
HTML
801 lines
37 KiB
HTML
<H1>Inside the High Performance File System</H1>
|
|
<H2>Part 1: Introduction</H2>
|
|
Written by Dan Bridges
|
|
|
|
<H2>Introduction</H2>
|
|
|
|
<P>
|
|
This article originally appeared in the February 1996 issue of
|
|
Significant Bits, the monthly magazine of the Brisbug PC User Group Inc.
|
|
|
|
<P>
|
|
It is sad to think that most OS/2 users are not using HPFS. The main
|
|
reason is that unless you own the commercial program Partition Magic,
|
|
switching to HPFS involves a destructive reformat and that most users
|
|
couldn't be bothered (at least initially). Another reason is user
|
|
ignorance of the numerous technical advantages of using HPFS.
|
|
|
|
<P>
|
|
This month we start a series that delves into the structures that make
|
|
up OS/2's HPFS. It is very difficult to get any public information on
|
|
it aside from what appeared in an article written by Ray Duncan in the
|
|
September '89 issue of Microsoft Systems Journal, Vol 4 No 5. I suspect
|
|
that the IBM-Microsoft marriage break-up that occurred in 1991 may have
|
|
caused an embargo on further HPFS information. I've been searching
|
|
books and the Internet for more than a year looking for information with
|
|
very little success. You usually end up finding a superficial
|
|
description without any detailed discussion of the internal layout of
|
|
its structures.
|
|
|
|
<P>
|
|
There are three commercial utilities that I've found very useful. SEDIT
|
|
from the GammaTech Utilities v3 is a wonder. It decodes quite a bit of
|
|
the information in HPFS' structures. HPFSINFO and HPFSVIEW from the
|
|
Graham Utilities are also good. HPFSINFO lists information gleaned from
|
|
HPFS' SuperBlock and SpareBlock sectors, while HPFSVIEW provides the
|
|
best visual display I've seen of the layout of a HPFS partition. You
|
|
can receive some information on a sector by clicking on it. HPFSVIEW is
|
|
also freely available in the demo version of the Graham Utilities,
|
|
GULITE.xxx. I've also written a REXX program to assist with
|
|
cross-referencing locations between SEDIT & HPFSVIEW and to provide a
|
|
convenient means of dumping a sector.
|
|
|
|
<P>
|
|
Probably the most useful program around at the moment is freeware,
|
|
FST03F.xxx (File System Tool) written by Eberhard Mattes. This provides
|
|
lots of information and comes with source. Even if you aren't a C
|
|
programmer (I'm not) you can learn much from its definition of
|
|
structures. Unfortunately I wrote the first three instalments without
|
|
seeing this information so that made the task more difficult.
|
|
|
|
<P>
|
|
In the early stages I've had to employ a very laborious process in an
|
|
attempt to learn more. I created the smallest OS/2 HPFS partition
|
|
possible (1 MB). Then I created/altered a file or directory and
|
|
compared the changes. Sometimes I knew where the changes would occur so
|
|
I could just compare the two sectors but often I ended up comparing two
|
|
1 MB image files looking for differences and then translated the location
|
|
in the image into C/H/S (a physical address in Cylinder/Head/Sector
|
|
format) or LSN (Logical Sector Number). While more information will be
|
|
presented in this series than I've seen in the public domain, there are
|
|
still things that I've been unable to decipher.
|
|
|
|
<P>
|
|
<H2>The Win95 Fizzer</H2>
|
|
|
|
<P>
|
|
For me, the most disappointing feature of Win 95 is the preservation of
|
|
the FAT (File Allocation Table) system. It's now known as VFAT but
|
|
aside from integrated 32-bit file and disk access, the structure on the
|
|
disk is basically the same as DOS v4 (circa 1988). An ungainly method
|
|
involving the volume label file attribute was used to graft long
|
|
filename support onto the file system. These engineering compromises
|
|
were made to most easily achieve backward compatibility. It's a pity
|
|
because Microsoft has an excellent file system available in NT, namely
|
|
NTFS. This file system is very robust although perhaps NTFS is overkill
|
|
for the small user.
|
|
|
|
<P>
|
|
The Program Manager graphical user interface (GUI) appeared in OS/2 v1.1
|
|
in 1988. The sophisticated High-Performance File System came with OS/2
|
|
v1.2 which was released way back in 1989! The powerful REXX scripting
|
|
language showed up in OS/2 v1.3 (1991). And the largely
|
|
object-orientated WPS (Work Place Shell) GUI appeared in 1992 in OS/2
|
|
v2.0. So it is hardly surprising that experienced OS/2 users were not
|
|
swept up in the general hysteria about Windows 95 being the latest and
|
|
greatest.
|
|
|
|
<P>
|
|
A positive aspect of the Win 95 craze has been that the minimum system
|
|
requirement of 8 MB RAM, 486/33 makes a good platform for OS/2 Warp. So
|
|
now the disgruntled Win 95 user will find switching OSs less daunting,
|
|
at least from a hardware viewpoint.
|
|
|
|
<P>
|
|
<H2>Dual Boot and Boot Manager</H2>
|
|
|
|
<P>
|
|
I've never used Dual Boot because it seems so limiting. I've always
|
|
reformatted and installed Boot manager so that I could select from up to
|
|
four different Operating Systems, for example OS/2 v2.1, OS/2 Warp
|
|
Connect (peer-to-peer networking with TCP/IP and Internet support), IBM
|
|
DOS v7 and Linux.
|
|
|
|
<P>
|
|
In previous OS/2 installations, I've left a small (50 MB) FAT partition
|
|
that could be seen when I booted under either DOS or OS/2, while the
|
|
rest of the HD space (910 MB) was formatted as HPFS. Recently I
|
|
upgraded to Warp Connect and this time I dropped FAT and the separate
|
|
DOS boot partition completely. This does not mean I am unable to run
|
|
DOS programs. OS/2 has inbuilt IBM DOS v5 and you can install boot
|
|
images of other versions of DOS, or even CP/M, for near instantaneous
|
|
booting of these versions. There is no reason why you can't have
|
|
multiple flavours of DOS running at the same time as you're running
|
|
multiple OS/2 sessions. Furthermore DOS programs have no problems
|
|
reading from, writing to or running programs on HPFS partitions even
|
|
though the layout is nothing like FAT. It's all handled transparently
|
|
by OS/2. But this does mean you have to have booted OS/2 first. HPFS
|
|
is not visible if you use either Dual Boot or Boot Manager to boot
|
|
directly to DOS, but there are a number of shareware programs around to
|
|
allow read-access to HPFS drives from DOS.
|
|
|
|
<P>
|
|
DOS uses the system BIOS to access the hard disk. This is limited to
|
|
dealing with a HD that has no more than 1,024 cylinders due to 10 bits
|
|
(2^10 = 1,024) being used in the BIOS for cylinder numbering. OS/2 uses
|
|
the system BIOS at boot time but then completely replaces it in memory
|
|
with a special Advanced BIOS. This means that the boot partition and,
|
|
if you use it, Boot Manager's 1 MB partition, must be within the first
|
|
1,024 cylinders. Once you've booted OS/2, however, you can access
|
|
partitions on cylinders past the Cyl 1023 point (counting from zero)
|
|
without having to worry about LBA (Logical Block Addressing) translation
|
|
schemes.
|
|
|
|
<P>
|
|
Now this can still catch you out if you boot DOS. On my old system I'd
|
|
sometimes use Boot Manager to boot a native DOS version. I'd load AMOS
|
|
(a shareware program) to see the HPFS drives. I thought there must have
|
|
been a bug in AMOS because I could only see half of F: and none of G:
|
|
until I realised that these partitions were situated on a third HD that
|
|
had 1,335 cylinders. So this was just the effect of DOS' 1,024 cylinder
|
|
limitation which the AMOS program was unable to circumvent.
|
|
|
|
<P>
|
|
<H2>Differences between an Easy and an Advanced Installation</H2>
|
|
|
|
<P>
|
|
Most new OS/2 users select the "Easy Installation" option. This is
|
|
satisfactory but it only utilises FAT, installs OS/2 on the same drive
|
|
as DOS and Windows, does not reformat the partition and Dual Boot is
|
|
installed.
|
|
|
|
<P>
|
|
If you know what you're doing or are more aggressive in wanting to take
|
|
advantage of what OS/2 can provide then the "Advanced Installation"
|
|
option is for you. Selecting it enables you to selectively install
|
|
parts of OS/2, install OS/2 in a primary or logical (extended) partition
|
|
other than C: or even on a 2nd HD (I don't know whether you can install
|
|
on higher physical drives than the 2nd one in a SCSI multi-drive setup);
|
|
the option of installing Boot Manager is provided; you can use HPFS if
|
|
you wish; installation can occur on a blank HD.
|
|
|
|
<P>
|
|
<H2>FAT vs HPFS: If Something Goes Wrong</H2>
|
|
|
|
<P>
|
|
CHKDSK on a HPFS partition can recover from much more severe faults than
|
|
it can on a FAT system. This is because the cluster linkages in a FAT
|
|
system are one-way, pointing to the next cluster in the chain. If the
|
|
link is broken it is usually impossible to work out where the lost
|
|
clusters ("x lost clusters in y chains") should be reattached. Often
|
|
they are just artifacts of a program's use of temporary files that
|
|
haven't been cleaned up properly. But "file truncated" and
|
|
"cross-linked files" messages are usually an indication of more serious
|
|
FAT problems.
|
|
|
|
<P>
|
|
HPFS uses double linking: the allocation block of a directory or file
|
|
points back to its predecessor ("parent") as well as to the next element
|
|
("child"). Moreover, major structures contain dword (32-bit) signatures
|
|
identifying their role and each file/directory's FNODE contains the
|
|
first 15 characters of its name. So blind scanning can be performed by
|
|
CHKDSK or other utilities to rebuild much of the system after a
|
|
significant problem.
|
|
|
|
<P>
|
|
As a personal comment, I've been using HPFS since April, 1993, and I've
|
|
yet to experience any serious file system problems. I've had many OS/2
|
|
lockups while downloading with a DOS comms program and until recently
|
|
I was running a 4 MB hardware disk cache with delayed writes, yet,
|
|
aside from the lost download file, the file system has not been
|
|
permanently corrupted.
|
|
|
|
<P>
|
|
<H2>Warp, FORMAT /FS:HPFS, CHKDSK /F:3 and The Lazarus Effect</H2>
|
|
|
|
<P>
|
|
Warp, by default, does a quick format when you format a HD under either
|
|
FAT or HPFS. So FORMAT /FS:HPFS x:, which is what the installation
|
|
program performs if you decide to format the disk with HPFS, is
|
|
performed very quickly. It's almost instantaneous if you decide to
|
|
reformat with FAT (/FS:FAT). Now this speed differential does not mean
|
|
that FAT is much quicker, only that FORMAT has very little work to
|
|
perform during a quick FAT reformat since the FAT structures are so
|
|
simple compared to HPFS.
|
|
|
|
<P>
|
|
As mentioned earlier, CHKDSK has extended recovery abilities when
|
|
dealing with HPFS. It has four levels of /F:n checking/recovery. These
|
|
will be considered in greater detail in a later article in this series
|
|
when we look at fault tolerance. The default of CHKDSK /F is equivalent
|
|
to using /F:2. If you decide to use /F:3 then CHKDSK will dig deep and
|
|
recover information that existed on the partition prior to the
|
|
reformatting providing that it was previously formatted as HPFS. Using
|
|
CHKDSK /F:3 after performing a quick format on a partition that was
|
|
previously FAT but is now HPFS will not cause this, since none of the
|
|
previous data has HPFS signature words embedded at the beginning of its
|
|
sectors. However, if you ever use /F:3 after quickly reformatting a
|
|
HPFS partition you could end up with a bit of a mess since everything
|
|
would be recovered that existed on the old partition and which hadn't
|
|
been overwritten by the current contents.
|
|
|
|
<P>
|
|
To guard against this, OS/2 stores whether or not a quick format has
|
|
been performed on a HPFS partition in bit 5 (counting from zero) of byte
|
|
08h in LSN (Logical Sector Number) 17, the SpareBlock sector. This
|
|
particular byte is known as the Partition Status byte, with 20h
|
|
indicating that a quick format was performed. Bit 0 of this byte is
|
|
also used to indicate whether the partition is "clean" or "dirty" so 21h
|
|
indicates that the partition was quick formatted and is currently
|
|
"dirty" (these concepts will be covered in a later instalment).
|
|
|
|
<P>
|
|
If you attempt to perform a CHKDSK /F:3 on a quick-formatted partition,
|
|
you will receive the following warning:
|
|
|
|
<PRE>
|
|
SYS0641: Using CHKDSK /F:3 on this drive may cause files that existed
|
|
before the last FORMAT to be recovered. Proceed with CHKDSK (Y/N)?
|
|
</PRE>
|
|
|
|
<P>
|
|
If you type "HELP 641" for further information you'll see:
|
|
|
|
<PRE>
|
|
EXPLANATION: The target drive was formatted in "fast format" mode,
|
|
which does not erase all data areas. CHKDSK /F:3 searches data areas
|
|
for "lost" files. If a file existed on this drive before the last
|
|
format, CHKDSK may find it, and attempt to recover it.
|
|
</PRE>
|
|
|
|
<P>
|
|
ACTION: Use CHKDSK /F:2 to check this drive. If you use /F:3, be aware
|
|
that files recovered to the FOUND directories may be old files. Also,
|
|
if you format a drive using FORMAT /L, FORMAT will completely erase all
|
|
old files, and avoid this warning.
|
|
|
|
<P>
|
|
It seems a pity to forego the power of the CHKDSK /F:3 in the future.
|
|
As is suggested, FORMAT /L (for "Long" I presume) will completely
|
|
obliterate the prior partition's contents, but you can't specify this
|
|
during a reinstall. To perform it you need to use FORMAT /L on the
|
|
partition before reinstalling. For this to be practical you will
|
|
probably need to keep OS/2 and nothing else on a separate partition and
|
|
to have a recent tape backup of the remaining volumes' contents. Note:
|
|
in my opinion keeping OS/2 on a separate partition is the best way of
|
|
laying out a system but make sure you leave enough room for things like
|
|
extra postscript fonts and programs that insist on putting things on C:.
|
|
|
|
<P>
|
|
<H2>Capacity</H2>
|
|
|
|
<P>
|
|
Figure 1 shows a table comparing the capacity of OS/2's FAT and HPFS
|
|
file systems. The difference in the logical drive numbers arises due to
|
|
A: and B: being assigned to floppies which are always FAT. It would
|
|
be ridiculous to put a complex, relatively large file system, which was
|
|
designed to overcome FAT's limitations with big partitions, on volumes
|
|
as small as current FDs.
|
|
|
|
<PRE>
|
|
FAT HPFS
|
|
|
|
Logical drives 26 24
|
|
Num of Partitions 16 16
|
|
Max Partition Size 2 GB 64 GB
|
|
Max File Size 2 GB 2 GB
|
|
Sector Size 512 bytes 512 bytes
|
|
Cluster/Block Size 0.5 KB-32 K 512 bytes
|
|
</PRE>
|
|
|
|
<FONT SIZE=2>
|
|
Fig.1 Comparing the capacity of FAT and HPFS
|
|
</FONT>
|
|
|
|
<P>
|
|
The next point of interest is the much greater partition size supported by HPFS.
|
|
HPFS has a maximum possible partition size of about 2,200 GB (2^21 sectors) but
|
|
is restricted in the current implementation to 64 GB. (Note: older references
|
|
state that the maximum is 512 GB.) I don't know what imposes this limitation.
|
|
Note: the effective limitation on partition size is currently around 8 GB.
|
|
This is due to CHKDSK's inability to handle a larger partition. I presume this
|
|
limitation will be rectified soon as ultra large HDs will become common in the
|
|
next year or two.
|
|
|
|
<P>
|
|
The 2 GB maximum filesize limit is common to DOS, OS/2 and 32-bit Unix. A
|
|
32-bit file size should be able to span a range of 4 GB (2^32) but the
|
|
DosSetFilePtr API function requires that the highest bit be used for indicating
|
|
sign (forward or backward direction of movement), leaving 31 for size.
|
|
|
|
<P>
|
|
The cluster size on a 1.4 MB FD is 512 bytes. For a 100 MB HD formatted
|
|
with FAT it is 2 KB. Due to the relatively small 64K (2^16) limit on
|
|
cluster numbering, as FAT partitions get bigger the size of clusters
|
|
must also increase. So for a 1-2 GB partition you end up with whopping
|
|
32 KB clusters. Since the average wastage of HD space due to the
|
|
cluster size is half a cluster per file, storing 10,000 files on such a
|
|
partition will typically waste 160 MB (10,000 * 32 KB / 2).
|
|
|
|
<P>
|
|
HPFS has no such limitation. File space is allocated in sector-sized
|
|
blocks unlike the FAT system. A FNODE sector is also always associated
|
|
with each file. So for 10,000 files, the wastage due to sector size is
|
|
typically 2.5 MB (10,000 * 512 / 2) for the files themselves + 5 MB
|
|
consumed by the file's FNODEs = 7.5 MB. And this overhead is constant
|
|
whether the HPFS partition is 10 MB or 100 GB.
|
|
|
|
<P>
|
|
This must be balanced against the diskspace consumed by HPFS. Since
|
|
HPFS is a sophisticated file system that is designed to accomplish a lot
|
|
more than FAT, it correspondingly requires more diskspace than FAT.
|
|
Figure 2 illustrates this. You may think that 10 MB for the file system
|
|
is too much for a 1,000 MB partition but you should consider this as a
|
|
percentage.
|
|
|
|
<PRE>
|
|
System Usage including Disk Space available Allocation Unit
|
|
MBR track to user + Fnode for HPFS
|
|
FAT/HPFS in KB FAT/HPFS in % FAT/HPFS in KB
|
|
|
|
10 MB 44/415 99.57/95.95 4/0.5+0.5
|
|
|
|
100 MB 76/3,195 99.77/96.88 2/0.5+0.5
|
|
|
|
1000 MB 289(est)/10,430 99.98(est)/98.98 16/0.5+0.5
|
|
</PRE>
|
|
|
|
<FONT SIZE=2>
|
|
Fig. 2: Space used by FAT and HPFS on different volumes
|
|
</FONT>
|
|
|
|
<P>
|
|
Furthermore, once cluster size wastage is also considered, then the
|
|
break-even point (as regards diskspace) for a 1,000 MB partition is
|
|
about 2,200 files which isn't very many files. This is based on a 16 KB
|
|
cluster size. In the 1,024-2,047 MB partition size range the cluster
|
|
size increases to 32 KB so the "crossover" point shifts to only 1,100
|
|
files.
|
|
|
|
<P>
|
|
I had to calculate the 1,000 MB FAT partition values since OS/2 wouldn't
|
|
let me have a FAT partition situated in the greater than Cyl 1023
|
|
region. The 4 KB cluster size of the 10 MB partition is not a misprint.
|
|
Below 16 MB, a 12-bit FAT scheme (1.5 bytes in the FAT representing 1
|
|
cluster) is used instead of a 16-bit one.
|
|
|
|
<P>
|
|
<H2>Directory Search Speed</H2>
|
|
|
|
<P>
|
|
Consider an extreme case: FAT system on a full partition which has a
|
|
maximum-sized FAT (64K entries - this is the maximum number of files a
|
|
FAT disk can hold). The size of such a partition would be 128 MB, 256
|
|
MB, 512 MB, 1 GB or 2 GB, depending on cluster size. Each FAT is 128 KB
|
|
in size. (There is a second FAT which mirrors the first.) In this
|
|
example all the files are in one subdirectory. This can't be in the
|
|
root directory because it only has space for 512 entries. (With HPFS
|
|
you can have as many files as you want in the root directory.) 64 K of
|
|
entries in a FAT directory requires 2 MB of diskspace (64K * 32
|
|
bytes/directory entry). To find a file, on average, 32 K directory
|
|
entries would need to be searched. To say that a file was not on the
|
|
disk, the full 64 K entries must be scanned before the "File not found"
|
|
message was shown. The same figures would apply in you were using a
|
|
file-finding utility to look for a file in 1,024 directories, each
|
|
containing 63 files (the subdirectory entry also consumes space).
|
|
|
|
<P>
|
|
If the directory entries were always sorted, the situation would greatly
|
|
improve. Assuming you had a quick means of getting to the file in the
|
|
sorted sequence, if it's the file you're looking for then you've found
|
|
its directory entry (and thus its starting cluster's address). If a
|
|
file greater in the sequence than the required file is found instead
|
|
then you immediately know that the file does not exist.
|
|
|
|
<P>
|
|
HPFS stores directory files in a balanced multi-branch tree structure
|
|
(B-tree) which is always sorted due to the way the branches are
|
|
assigned. This can lead to some extra HD activity, caused by adjustment
|
|
of the tree structure, when a new file is added or a file is renamed.
|
|
This is done to keep the tree balanced i.e. the total length of each
|
|
branch from the root to the leaves is the same. The extra work when
|
|
writing to the disk is hidden from the user by the use of "lazy writes"
|
|
(delayed write caching).
|
|
|
|
<P>
|
|
HPFS directory entries are stored in contiguous directory blocks of four
|
|
sectors i.e. 2 KB known as DIRBLKs. A lot of information is stored in
|
|
each variable-length (unlike FAT) file entry in a DIRBLK structure,
|
|
namely:
|
|
|
|
<UL>
|
|
<LI>The length of the entry;
|
|
<LI>File attributes;
|
|
<LI>A pointer to the HPFS structure (FNODE; usually just before the
|
|
first sector of a file) that describes the sector disposition of the
|
|
file;
|
|
<LI>Three different date/time stamps (Created, Last Accessed, Last
|
|
Modified);
|
|
<LI>Usage count. Although mentioned in the 1989 document, this has not
|
|
have been implemented;
|
|
<LI>The length of the name (up to 254 characters);
|
|
<LI>A B-tree pointer to the next level of the tree structure if there
|
|
are any further levels. The pointer will be to another directory
|
|
block if the directory entries are too numerous to fit in one 2 KB
|
|
block;
|
|
</UL>
|
|
|
|
<P>
|
|
At the end of the sector there is extra ("flex") space available for
|
|
special purposes.
|
|
|
|
<P>
|
|
If the average size of the filenames is 10-13 characters, then a
|
|
directory block can store 44 of them (11 entries/sector). A two-level
|
|
B-tree arrangement can store 1,980 entries (1 * 44-entry directory root
|
|
block + 44 directory leaf blocks * 44 entries/block) while a three-level
|
|
structure could accommodate 87,164 files (the number of files in the
|
|
two-level tree + 1,936 third-level directory leaf blocks * 44
|
|
entries/block). So the 64 K of directory entries in our example can be
|
|
searched in a maximum of 3 "hits" (disk accesses). The term "maximum"
|
|
was used because it depends on what level the filename in question is
|
|
stored in the B-tree structure and what's in the disk cache.
|
|
|
|
<P>
|
|
Adding files to a directory containing many files (say 500+) under FAT
|
|
becomes an exasperating affair. I've often experienced this because a
|
|
DOS program we've installed on hundreds of our customer's machines has
|
|
648 files in a sub-sub-subdirectory. Watching the archive unpack on a
|
|
machine without disk caching is bad news and it still slows down
|
|
noticeably on machines with large SMARTDRIVE caches.
|
|
|
|
<P>
|
|
Figure 3 shows a simple REXX program you can create to investigate this
|
|
phenomenon while Figure 4 tables some results. The program creates a
|
|
large number of zero-length files in a directory. Perform this test in
|
|
a subdirectory to overcome FAT's restriction on a maximum of 512 entries
|
|
in the root directory. Reformating and rebooting was performed before
|
|
each test to ensure consistent conditions. With both FAT and HPFS, a
|
|
1,536 KB lazy-writing cache with a maximum cacheable read/write size of
|
|
8 KB was used. Note 1: with HPFS, a "zero-length" file consumes
|
|
diskspace because there is always a FNODE sector associated with a
|
|
file/directory, regardless of the file's contents. So 1,000 empty files
|
|
consume 500 KB of space. Note 2: there is a timing slop of about 0.1
|
|
seconds due to the 55 msec timer tick uncertainty affecting both the
|
|
start time and stop time values.
|
|
|
|
<PRE>
|
|
/* Create or open a large number of empty files in a directory */
|
|
CALL Time 'R' /* Reset timer */
|
|
|
|
DO x = 1 TO 1000
|
|
CALL STREAM 'file'||x, 'c', 'open' /* Will create if not exist */
|
|
CALL STREAM 'file'||x, 'c', 'close'
|
|
END
|
|
|
|
SAY Time('E') /* Report elapsed time */
|
|
</PRE>
|
|
|
|
<FONT SIZE=2>
|
|
Fig 3: A REXX program to assess the directory searching and file
|
|
creation speeds of FAT and HPFS.
|
|
</FONT>
|
|
|
|
<PRE>
|
|
Number of Files in a Directory
|
|
|
|
125 250 500 1000 2000 4000 4001
|
|
->4100
|
|
|
|
FAT 1.7 3.4 8.0 23.4 99.4 468.4 26.6
|
|
FAT (LW) 0.7 1.7 5.1 17.9 89.6 447.3 26.1
|
|
|
|
HPFS 7.4 14.7 30.7 62.9 129.0 262.6 7.5
|
|
HPFS (LW) 0.5 1.0 2.2 4.5 9.0 18.3 0.5
|
|
</PRE>
|
|
|
|
<FONT SIZE=2>
|
|
Fig 4: Timing results of the program in Figure 3. The beneficial effect
|
|
of lazy writing on performance is clearly demonstrated. Tests were
|
|
performed in an initially empty subdirectory except for the last one
|
|
which adds 100 new files to a subdirectory already containing 4,000
|
|
files.
|
|
</FONT>
|
|
|
|
<P>
|
|
To investigate further, the full data set was plotted on a graph with
|
|
logarithmic axes. Examine Figure 5. As you can see, HPFS' performance
|
|
is reasonably linear (in y = a*x^b + c, b was actually 1.1) while FAT's
|
|
performance appears to follow a third-order polynomial (y = a*x^3 +
|
|
b*x^2 + c*x + d). It is apparent that FAT's write caching becomes less
|
|
effective when many files are in a directory presumably because much
|
|
time is being spent sifting through the FAT in memory. (Disk access was
|
|
only occurring briefly about once a second based on the flashing of the
|
|
HD light). HPFS' performance was dramatically improved in this test by
|
|
the use of write caching. Again, disk access was about once a second
|
|
(due to CACHE's /MAXAGE:1000 parameter). While, typically, most disk
|
|
access will involve reading rather than writing, this graph shows how
|
|
effective lazy writing is at hiding the extra work from the user. It is
|
|
also apparent that HPFS handles large numbers of files well. We now
|
|
turn to examining how this improvement is achieved.
|
|
|
|
<P>
|
|
<A HREF="fig5.gif">
|
|
<IMG WIDTH=100 HEIGHT=57 SRC="fig5_small.gif"></A>
|
|
|
|
<P>
|
|
<FONT SIZE=2>
|
|
Fig. 5: Log-log graph comparing file system performance creating test
|
|
files in a subdirectory. Extra data points shown. Number of files was
|
|
increased using a cube-root-of-2 multiple. (Click for large version.)
|
|
</FONT>
|
|
|
|
<P>
|
|
<H2>Directory Location and Fragmentation</H2>
|
|
|
|
<P>
|
|
Subdirectories on a FAT disk are usually splattered all around it.
|
|
Similarly, entries in a subdirectory may not all be in contiguous
|
|
sectors on the disk. Searching a FAT system's directory structure can
|
|
involve a large amount of HD seeking back and forth, i.e. more time.
|
|
Sure, you can use a defragger option to move all the directories to the
|
|
front of the disk, but this usually takes a lot of time to reshuffle
|
|
everything and the next time you create a new subdirectory or add files
|
|
to an existing subdirectory there will be no free space up the front so
|
|
directory separation and fragmentation will occur again.
|
|
|
|
<P>
|
|
HPFS takes a much better approach. On typical partitions (i.e. not
|
|
very small ones) a directory band, containing many DIRBLKs, is placed at
|
|
or near the seek centre (half the maximum cylinder number). On a 100 MB
|
|
test partition the directory band starts at Cyl 48 (counting from 0) of
|
|
a volume that spans 100 cylinders. Here 1,980 contiguous Directory
|
|
sectors (just under 1 MB) were situated. Assuming 11 entries per
|
|
Directory sector (44 entries per DIRBLK), this means that the first
|
|
21,780 directory entries will be next to each other. So if a blind file
|
|
search needs to be performed this can be done with just 1 or 2 long disk
|
|
reads (assuming <20,000 files and 1-2 MB disk cache). The maximum
|
|
size of the contiguous directory band appears to be 8,000 KB for about
|
|
176,000 entries with 13-character names. Once the directory band is
|
|
completely full new Directory sectors are scattered throughout the
|
|
partition but still in four-sector DIRBLKs.
|
|
|
|
<P>
|
|
Another important aspect of HPFS' directory band is its location. By
|
|
being situated near the seek centre rather than at the very beginning
|
|
(as in FAT), the average distance that the heads must traverse, when
|
|
moving between files and directories, is halved. Of course, with lazy
|
|
writing, traversals to frequently update a directory entry while writing
|
|
to a temporary file, would be much reduced anyway.
|
|
|
|
<P>
|
|
<H2>File Location and Fragmentation</H2>
|
|
|
|
<P>
|
|
HPFS expends a lot of effort to keep a file either in one piece if
|
|
possible or otherwise within a minimum number of pieces and close
|
|
together on the disk so it can be retrieved in the minimum number of
|
|
reads (remembering also that cache read-ahead can take in more than one
|
|
nearby piece in the same read). Also, the seek distance, and hence time
|
|
required to access extra pieces, is kept to an absolute minimum. The
|
|
main design philosophy of HPFS is that mechanical head movement is a
|
|
very time-consuming operation in CPU terms. So it is worthwhile doing
|
|
more work looking for a good spot on the disk to place the file. There
|
|
are many aspects to this and I'm sure there are plenty of nuances of
|
|
which I'm ignorant.
|
|
|
|
<P>
|
|
Files are stored in 8 MB contiguous runs of sectors known as data bands.
|
|
Each data band has a four-sector (2 KB) freespace bitmap situated at
|
|
either the band's beginning or end. Consecutive data bands have
|
|
tail-to-head placement of the freespace bitmaps so that maximum
|
|
contiguous filespace is 16 MB (actually 16,380 KB due to the presence of
|
|
the bitmaps within the adjoining band). See Figure 6.
|
|
|
|
<P>
|
|
<IMG WIDTH=403 HEIGHT=213 SRC="fig6.gif">
|
|
|
|
<P>
|
|
<FONT SIZE=2>
|
|
Fig. 6: The basic data layout of an HPFS volume
|
|
</FONT>
|
|
|
|
<P>
|
|
Near the start of the partition there is a list of the sectors where
|
|
each of the freespace bitmaps commences. I'm sure that this small list
|
|
would be kept loaded into memory for performance reasons. Having two
|
|
small back-to-back bitmaps adjoining a combined 16 MB data band is
|
|
advantageous when HPFS is looking for the size of each freespace region
|
|
within bands, prior to allocating a large file. But it does mean that a
|
|
fair number of seeks to different bitmaps might need to be performed on
|
|
a well-filled disk, in search of a contiguous space. Or perhaps these
|
|
bitmaps are also kept memory resident if the disk is not too big.
|
|
|
|
<P>
|
|
A 2 GB file would be split into approximately 128 chunks of 16 MB, but
|
|
these chunks are right after each other (allowing for the presence of
|
|
the intervening 4 KB of back-to-back freespace bitmaps). So to refer to
|
|
this file as "fragmented", while technically correct, would be
|
|
misleading.
|
|
|
|
<P>
|
|
As mentioned earlier, every file has an associated FNODE, usually right
|
|
before the start of the file. The number of pieces a file is stored in
|
|
are referred to as extents. A "zero-length" file has 0 extents; a
|
|
contiguous file has 1 extent; a file of 2-8 extents is "nearly"
|
|
contiguous (the extents should be close together).
|
|
|
|
<P>
|
|
An FNODE sector contains:
|
|
|
|
<UL>
|
|
<LI>The real filename length;
|
|
<LI>The first 15 characters of the filename;
|
|
<LI>Pointer to the directory LSN that contains this file;
|
|
<LI>EAs (Extended Attributes) are completely stored within the FNODE
|
|
structure if the total of the EAs is 145 bytes or less;
|
|
<LI>0-8 contiguous sector runs (extents), organised as eight LSN
|
|
run-starting-points (dword), run lengths (dword) and offsets into
|
|
the file (dword).
|
|
</UL>
|
|
|
|
<P>
|
|
A run can be up to 16 MB (back-to-back data bands) in size. If the file
|
|
is too big or more fragmented than can be described in 8 extents, then
|
|
an ALNODE (allocation block) is pointed to from the FNODE. In this case
|
|
the FNODE structure changes so that it now contains up to 12 ALNODE
|
|
pointers within the FNODE and each ALNODE can then point to either 40
|
|
direct sector runs (extents) or to 60 further ALNODEs, and each of these
|
|
lower-level ALNODEs could point to either... and so on.
|
|
|
|
<P>
|
|
If ALNODEs are involved then a modified balanced tree structure called a
|
|
B+tree is used with the file's FNODE forming the root of the structure.
|
|
So only a two-level B+tree would be required to completely describe a 2
|
|
GB (or smaller) file if it consists of less than 480 runs (12 ALNODEs *
|
|
40 direct runs described in each ALNODE). Otherwise a 3-level structure
|
|
would have no problems since it can handle up to 28,800 runs (12 ALNODEs
|
|
* 60 further ALNODEs * 40 direct runs). It's difficult to imagine a
|
|
situation where a four or higher level B+tree would ever be needed.
|
|
|
|
<P>
|
|
Consider how much disk activity would be required to work out the layout
|
|
of a 2 GB file under FAT and under HPFS. With FAT the full 128 KB of
|
|
the FAT must be read to determine the file's layout. If this layout can
|
|
be kept in the cache during the file access then fine. Otherwise the
|
|
FAT would need to be reread one or more times (probably starting from
|
|
the beginning on each reread). With HPFS, up to 361 sector reads, in a
|
|
three-level B+tree structure, and possibly up to just 13 sector reads,
|
|
in a two-level structure, would provide the information. The HPFS
|
|
figures are maximums and the actual sector-read figure would most
|
|
probably be much lower since HPFS was trying hard to reduce the number
|
|
of runs when the file was written. Also if the ALNODEs are near each
|
|
other then read-ahead would reduce the actual hits. Furthermore, OS/2
|
|
will keep the file's allocation information resident in memory while the
|
|
file is open, so no rereads would be needed.
|
|
|
|
<P>
|
|
If you've ever looked at the layout of files on a HPFS partition, you
|
|
may have been shocked to see the large gaps in the disk usage. This is
|
|
FAT-coloured thinking. There are good reasons not to use the first
|
|
available spot next to an existing file, particularly in a multitasking
|
|
environment where more than one write operation can be occurring
|
|
concurrently. HPFS uses three strategies here that I'm aware of.
|
|
First, the destination of write operations involving new files will tend
|
|
not to be near (preferably in a different band from) where an existing
|
|
file is also being updated. Otherwise, fragmentation would be highly
|
|
likely to occur.
|
|
|
|
<P>
|
|
Second, 4 KB of extra space is allocated by the file system to the end
|
|
of a file when it is created. Again the reason is to reduce the
|
|
likelihood of fragmentation from other concurrent writing tasks.
|
|
If not utilised, this space is recovered afterwards. To test this
|
|
assertion, create the REXX cmdfile shown in Figure 7 and run it on an
|
|
empty HPFS partition. (You can also do this on a partition with files
|
|
in it but it is easier on an empty one.) Run it and when the "Press any
|
|
key" message appears start up another OS/2 session and run CHKDSK (no
|
|
switches) on the partition under examination. CHKDSK will get confused
|
|
about the space allotted to the file open in the other session and will
|
|
say it is correcting an allocation error (which it really isn't doing
|
|
because you did not use the /F switch). Ignore this and notice that "4
|
|
kilobytes are in 1 user files". Switch back to the other session and
|
|
press Enter to close the file. Repeat and again run CHKDSK in the other
|
|
session. Notice this time that no extra space is allocated since the
|
|
file is being reopened rather than being created.
|
|
|
|
<PRE>
|
|
/* Test to check the space
|
|
preallocated to an open file */
|
|
|
|
CALL STREAM 'zerofile', 'c', 'open'
|
|
/* Will create if it does not exist */
|
|
'@pause'
|
|
CALL STREAM 'zerofile', 'c', 'close'
|
|
</PRE>
|
|
|
|
<FONT SIZE=2>
|
|
Fig. 7: A simple REXX program to demonstrate how HPFS allocates 4 KB of
|
|
diskspace to a new file.
|
|
</FONT>
|
|
|
|
<P>
|
|
Third, if a program has been written to report the likely filesize to
|
|
OS/2, or if you are copying an existing file (i.e. the final filesize
|
|
is known) then HPFS will expend a great deal of effort to find a free
|
|
space big enough to accommodate the file in one extent. If that is not
|
|
possible then it looks for two free spaces half the size of the file and
|
|
so on. Again this can result in two files in a directory not being next
|
|
to each other on the disk.
|
|
|
|
<P>
|
|
Since DOS and Windows programs are not written with preallocation space
|
|
requesting in mind, they tend to be more likely candidates for
|
|
fragmentation than properly written OS/2 programs. So, for example,
|
|
using a DOS comms program to download a large file will often result in
|
|
a fragmented file. Compared with FAT, though, fragmentation on heavily
|
|
used HPFS volumes is very low, usually less than 1%. We'll consider
|
|
fragmentation levels in more depth in Part 3.
|
|
|
|
<P>
|
|
<H2>Other Matters</H2>
|
|
|
|
<P>
|
|
It has also been written that the HPFS cache is smart enough to adjust
|
|
the value of its sector read-ahead for each opened file based on the
|
|
file's usage history or its type (Ray Duncan, 1989). It is claimed that
|
|
EXE files and files that typically have been fully read in the past are
|
|
given big read-aheads when next loaded. This is a fascinating concept
|
|
but unfortunately it has not been implemented.
|
|
|
|
<P>
|
|
Surprisingly, like other device drivers, HPFS is still 16-bit code. I
|
|
think this is one of the few remaining areas of 16-bit code in Warp. I
|
|
believe IBM's argument is that 32-bit code here would not help
|
|
performance much as mechanical factors are the ones imposing the limits,
|
|
at least in typical single-user scenarios.
|
|
|
|
<P>
|
|
HPFS is run as a ring 3 task in the 80x86 processor protection mechanism
|
|
i.e. at the application level. HPFS386 is a 32-bit version of HPFS
|
|
that comes only with IBM LAN SERVER Advanced Version. HPFS386 runs in
|
|
ring 0, i.e. at kernel level. This ensures the highest file system
|
|
performance in demanding network situations. It can also provide much
|
|
bigger caches than standard HPFS which is limited to 2 MB. There is a
|
|
chance that this version will appear in a later release of Warp.
|
|
|
|
<P>
|
|
OS/2 v2.x onwards also boosts the performance of FAT. This improvement,
|
|
called "Super FAT", is a combination of 32-bit executable code and the
|
|
mirroring of the FAT and directory paths in RAM. This requires a fair
|
|
bit of memory. Also Super FAT speeds the search for free space by
|
|
representing in memory in a bitmap used sectors in the FAT. This does
|
|
help the performance but I think the results in Figure 4, which were
|
|
performed using the Super FAT system, still highlight FAT's
|
|
architectural weaknesses.
|
|
|
|
<P>
|
|
You can easily tell whether a partition is formatted under HPFS or FAT. Just
|
|
run DIR in the root directory. If "." and ".." directory entries are shown
|
|
then HPFS is used [Unless the HPFS partition was formatted under Warp 4 -- Ed].
|
|
|
|
<P>
|
|
<H2>Conclusion</H2>
|
|
|
|
<P>
|
|
HPFS does require 300-400 KB of memory to implement, so it's only
|
|
suitable for OS/2 v2.1 systems with at least 12 MB or Warp systems with
|
|
at least 8 MB. For partitions of 100 MB+ it offers definite technical
|
|
advantages over FAT. By now you should have developed an understanding
|
|
of how these improvements are achieved.
|
|
|
|
<P>
|
|
In the next installment, we look at a shareware program to visually
|
|
inspect the layout of a HPFS partition and a REXX program to dump the
|
|
contents of a disk sector by specifying either decimal LSN, hexadecimal
|
|
LSN, dword byte-order-reversed hexadecimal LSN (what you see when you
|
|
look at a dword pointer in a hex dump) or Cyl/Hd/Sec coordinates. Other
|
|
REXX programs will convert the data stored in the SuperBlock and the
|
|
SpareBlock sectors into intelligible values. You should find it quite
|
|
informative.
|