Escaping the UNIX1 Tar Pit
Producing CD-ROMs in the UNIX Environment©
Authored & Published in January of 1991
Stan J. Caterbone
Director of CD-ROM Technologies for American Helix Technology
Director of Advanced Media Group, Ltd.
1857 Colonial Village Lane
Lancaster, PA 17601.
Phone: (800) 525-6575
Fax: (717) 392-7897
John S. Garofolo
National Institute of Standards and Technology
Gaithersburg, MD 20899
Phone: (301) 975-3193
UNIX is a trademark of American Telephone and Telegraph, Inc. (AT&T). 2Disclaimer: Certain trade names and company products are mentioned in the text in order to adequately specify procedures and equipment used. In no case does such identification imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the products are necessarily the best available for the purpose.
Just when things are going smoothly, and we begin to feel a little too comfortable and too confident with CD-ROM technology, someone or something puts us in our place -- and thankfully so. It's these challenges that facilitate our progress toward broadening the horizons of CD-ROM technologies.
This article is intended to inform publishers and manufacturers of the problems that can be encountered in using UNIX tar-formatted files as a medium of data submission for CD-ROM production and some of the issues confronting the next generation of CD-ROM publishers. Databases developed on non-DOS-based3 systems which have performance requirements that exceed MS-DOS capabilities are becoming more commonplace. Ironically, the existing CD-ROM production infrastructure has been created and supported primarily by DOS-based systems. Although we are making progress in publishing data on other platforms, a large majority of the CD-ROMs published today are still designed on DOS machines for use on DOS machines. The current tendency to link CD-ROM with DOS is making difficult the implementation of CD-ROM technology on non-DOS systems and, therefore, slowing its widespread acceptance. 3DOS is a trademark of the International Business Machines Corporation (IBM) and MS-DOS is a trademark of the Microsoft Corporation.
The ensuing paragraphs illustrate the need for the CD-ROM industry become more in tune with the trends which are shaping information technologies. CD-ROM, which is one such information technology, is beginning to recruit a new breed of both users and publishers, which are hoping that CD-ROM will adapt to them, as opposed to them having to adapt to it. The Automated Speech Recognition Group of the National Institute of Standards and Technology (NIST) is one such CD-ROM publisher.
The NIST Automated Speech Recognition Group
Sponsored in part by the Defense Advanced Research Projects Agency Information Science and Technology Office (DARPA-ISTO), the group designs and implements methods of performance evaluation for spoken language systems. These systems consist of natural language understanding as well as speech recognition components. Additionally, it distributes databases, or corpora, of speech recordings as standard reference material for the development and evaluation of these systems.
Traditionally, these speech corpora have been recorded and stored in a digital form rather than in an analog audio format. This allows the data to be easily loaded, stored, and manipulated in computers and prevents signal degradation in copies. The speech is digitized at a sampling rate of between 10 and 20 kHz., as opposed to the 44.1 kHz. sampling rate used in CD-audio. Digitizing speech at these sampling frequencies keeps intact the properties of the speech signal that are important for automatic speech recognition while minimizing storage requirements. These corpora typically consist of thousands of spoken phrases or sentences which are stored in separate files for ease of computer manipulation.
In the mid 1980's, the NIST began an archival/lending library for public domain speech corpora. The corpora were originally maintained and distributed on half-inch reel-to-reel digital magnetic computer tapes. Initially, these corpora were small, but as recognition systems became more sophisticated, their appetite for "training" data grew tremendously. By the end of the decade these corpora were each occupying 50 or more 6250 bpi. half-inch magnetic tapes and even larger databases were on the horizon. Managing these colossal databases of speech had become a real problem. Simply storing, copying, and distributing the corpora had become unwieldy. Furthermore, maintaining the integrity of the corpora was even more difficult as tapes were frequently damaged in shipment or by rogue tape drives.
NIST and CD-ROM
By early 1988, the NIST Automated Speech Recognition Group had begun investigating optical disk storage technologies as a means of replacing its tape archives. Initially, Write-Once Read- Many (WORM) technology was considered for use as a universal distribution medium but was found to lack adequate standardization. Fortunately, in the Spring of 1988, the ISO-9660 file format standard for CD-ROM was adopted and CD-ROM was chosen by NIST as a new "experimental" medium for distributing speech corpora.
NIST decided that the first corpus to be produced on CD-ROM would be the DARPA "TIMIT" Acoustic-Phonetic Continuous Speech Corpus. Under DARPA sponsorship, TIMIT was jointly designed, recorded, transcribed, and archived by Texas Instruments (TI) , the Massachusetts Institute of Technology (MIT), SRI International, and the National Bureau of Standards (now NIST). The TIMIT corpus was designed to provide speech data for the acquisition of acousticphonetic knowledge and for the development and evaluation of automatic speech recognition systems. The corpus contains recordings of 630 speakers from 8 major dialect divisions of American English each speaking 10 phonetically-rich sentences. In addition to standard orthographic (text) transcriptions, TIMIT contains unique time-aligned phonetic transcriptions.
NIST felt that TIMIT's unique structure would be of great interest to speech researchers and, therefore, would probably be ideal for widespread publication on CD-ROM. NIST decided to publish two-thirds of the corpus on a "prototype" CD-ROM. Because of the ISO-9660 restrictions on filename length and format, the chosen two-thirds of the corpus to be placed on CD-ROM was restructured from a flat directory structure with lengthy unique UNIX filenames into a dense 5-level directory hierarchy, which reflected the design of the corpus and conformed to ISO-9660. The resulting directory structure contained 4200 bottom-level subdirectories -- one for each sentence-utterance, and 3 files per utterance for a total of 12,600 data files! This new organization required the use of the entire path and filename to uniquely identify a file but was "visually navigable."
To date, more than 200 "TIMIT Prototype" discs have been distributed to universities and speech research laboratories worldwide. The discs were well received by the speech research community and have been read on PC's, Macintoshes4, various UNIX systems, NeXT5 machines and MicroVAXes6. The "experiment" had proved to be successful.
As of this writing, NIST has produced four releases of speech corpora on eight discs. Recently, NIST completed production of its most ambitious speech disc so far. The new disc is a complete revision of the TIMIT Prototype disc and contains the speech for the complete 630-speaker corpus as well as all-new time aligned word-boundary transcriptions. The new TIMIT CD-ROM contains 25,200 data files (4 files per utterance) as well as more extensive documentation and software utilities.
After the production of the TIMIT prototype disc, NIST recognized the need to distribute speech 4Macintosh is a trademark of Apple Computer, Inc. 5NeXT is a trademark of NeXT, Inc. corpora in a consistent format. Unfortunately, no standard file format existed for storing and exchanging speech signals. Compounding this problem, almost every speech research laboratory around the world used different hardware and software configurations for speech signal processing and analysis.
A UNIX-Based CD-ROM Preparation Workstation
In order to implement a full scale CD-ROM production effort, the Automated Speech Recognition Group built a UNIX-based CD-ROM publishing workstation, which also doubles as a general-purpose speech research system. CD-ROM images are prepared on a Sun Microsystems server system with 32 megabytes of main memory, 3 gigabytes of high-speed magnetic disc storage, a 9- track tape drive, an 8mm tape drive, and of course a CD-ROM drive. The workstation contains two 1.2 gigabyte magnetic disc drives on which entire CD-ROM images can be assembled and simulated.
Each CD-ROM is now organized entirely in the UNIX environment. Many of the standard UNIX utilities and capabilities have proven ideal tools for CD-ROM preparation. Tar files are now submitted for CD-ROM replication on one 8mm tape, instead of 5 or 6 half-inch reel-to-reel tapes.
UNIX-based CD-ROM premastering software is planned to be added in the near future to help alleviate some of the complications NIST has experienced in submitting data for replication. By performing ISO-9660 formatting in house, an ISO-9660 image can be submitted to the replication facility. The ISO-9660 image can then be directly loaded into a mastering system – thus circumventing the problems which can occur downloading tar-formatted files.
NIST has developed strategies to maximize the portability of its CD-ROMs by organizing speech data into a consistent format and providing utilities which can be linked into each laboratory's unique hardware and software systems. To accomplish this, a flexible, object-oriented header structure was developed for the exchange of speech files, especially on CD-ROM. The header is an ASCII-based structure prepended to each speech file and allows an utterance to be uniquely identified (even if the file is copied from CD-ROM and inadvertently renamed) and describes basic attributes of the speech signal to aid in digital to analog operations. A set of software utilities have been written, "Speech Header Resources" (SPHERE), to provide a low-level interface for importing and manipulating these files. NIST now publishes all speech data in this more consistent format.
A Data Submission Problem
All of the key components for efficient CD-ROM production were in place at NIST, except for a vehicle for data submission. When NIST initially delved into the world of CD-ROM production, it was dismayed to learn that most CD-ROM replication facilities accepted only standard ANSI labeled or ISO-9660 imaged tapes as transfer media. The small Automated Speech Recognition Group could not justify the expense of purchasing a special-purpose premastering workstation dedicated to creating ISO-9660 tapes. Neither could NIST provide standard ANSI-labeled tapes because the simple structure of ANSI-formatted files would not preserve the extensive directory structure required by the many files typically contained in speech corpora.
The UNIX tar Answer?
The tar-formatted tape is the standard medium of data exchange in the UNIX world and NIST had been successfully distributing speech corpora on "tar tapes" for several years. The UNIX tar (Tape Archive) utility was designed to create a portable archive format for UNIX files. The tar program generates a single file (usually on magnetic tape) which contains all of the information necessary for reconstituting directories, files, and UNIX-specific file parameters. What distinguishes the tar utility from most other archive programs is that the archive format it creates is portable across machines and operating systems. The key to the tar format's portability is in its simplicity. Tar does not employ any elaborate compression algorithms when generating an archive. It simply creates a byte-for-byte copy of each file to be archived with a prepended header block. The header block contains the path and name of the file (or directory), the file size, the time of last modification, and UNIX ownership and permission flags. Because the information in the each header block as well as the file itself is byte-encoded, the tar file can be read by any system which can recognize a stream of bytes. Of course, binary executable files are system-specific and cannot usually be implemented on differing systems. But text, source code, and binary data files can be easily exchanged.
To date, the tar program has been ported to many operating systems, including MS-DOS and VMS8 as well as the many variants of UNIX. Because the tar format is portable and preserves directory hierarchy, and because a tar file can be written to a standard ANSI-labeled tape or any other storage medium, NIST concluded that tar formatted ANSI tapes would be the ideal vehicle for providing a CD-ROM-ready file image to a replication plant. Unfortunately, NIST has found that most replication plants either refuse to accept tar-formatted files or they charge considerable "data conversion" fees to download the files into their premastering systems. To say the least, the acceptance of tar as an input medium for CD-ROM production has been less than universal by the CD-ROM replication industry. The replication facilities that have ventured into the "tar pit" with NIST have frequently encountered technical delays and cost overruns. In theory, the tar-tape to CD-ROM process should be simple.
But in reality, it has rarely been straightforward to implement. Pitfalls in Extracting a CD-ROM Image from a UNIX tar File The challenges encountered in producing a CD-ROM from a 630-megabyte tar tape, which contains over 25,000 files, can at first seem insurmountable. Several problems have occurred during production, some of which are still not completely resolved. Downloading and extracting a CD-ROM image from a tar file can be excruciatingly slow, taking 15 or more machine hours of time for a single disc image. If a tar file is packed with thousands of files, unforeseen complications can arise in the extraction process, and diagnosing and troubleshooting all of the subsystems involved can become painful for even the most experienced of engineers and technicians.
Extracting the file structure from a tar file for a CD-ROM such as the new TIMIT disc requires a great deal of time and attention because of the extraordinary number of directories and files. The subsystems involved in the tar extraction process require seamless integration. These include the PC hardware platform and MS-DOS operating system, the premastering system, the device drivers, controller cards, tape back-up systems, and the tar utility. Limitations inherent in the MS-DOS operating system, device drivers, and file structures can result in breakdowns in any one of these subsystems resulting in the loss of hours of man and machine time in the production process.
Eight-mm tape subsystems can be especially vulnerable when extracting exceedingly large numbers of files. This is because 8mm tape drives are mechanically suited for streaming operations. They are not as accommodating as 9-track tape drives in the quick stopping and starting movements, which become necessary when extracting many thousands of small files. Additional loss of efficiency occurs when 8mm drives must interface with a system, which has become bogged-down with overloaded magnetic disk sub-systems. The only way to optimize their operation is to load and buffer large blocks of raw data before it is tar-extracted. Subtle problems may also arise when the controller cards of some 8mm tape systems are not entirely compatible with the publishing system being used. These and other unforeseen problems can cause a tape drive to abort operations well before completion of the extraction process. Worse yet, because the tar format does not guarantee that directories and files are stored in any particular order, an entire tar file must be scanned to extract any subset of files contained in it. If the tar-extraction process aborts before the end of the tar file is reached, the entire process must be restarted from the beginning to insure that all files are loaded. These constraints require that special efforts be taken to prepare backup tapes and even second backup tapes during production. This is one area of risk where the insurance is well worth the effort, and is within one's control. Many of the other pitfalls are not as easy to anticipate or avoid.
One of the more frustrating problems encountered while downloading the TIMIT tar file was that of the overhead created while extracting the 18,900 small transcription files. To illustrate this point, during the downloading of the 632-megabyte tar file, containing the 25,241 TIMIT files, the process aborted on 650-, 850-, and 1200-megabyte partitions due to insufficient disc space!
On UNIX systems, the size of file blocks (similar to the ISO-9660 and DOS sector structures) can be modified. Although the ISO-9660 standard supports different sector sizes, the individual operating systems used in the premastering process may present problems. For example, MS-DOS 3.31 does not allow any modifications to sector size. Fortunately, MS-DOS 4.0 is more forgiving.
The TIMIT tar file contained 18,900 transcription files of under 2Kb each. A premastering system running DOS 3.31 with a 16Kb sector size would require over 300 megabytes of disk storage for these files, which actually amount to less than 32 megabytes of data. This results in disk overhead of 1 order of magnitude! However, by switching to DOS 4.0, the sector size can be reduced to as little as 512 bytes. This significantly reduces the overhead being used by the DOS partition. It is therefore important to adjust the sector size to accommodate the size of the database files to be downloaded. To maximize disk usage, the sector size should be set high when premastering a database with a few large textual files. But when a database (such as TIMIT) contains many small files, the sector size should be greatly reduced. Likewise, it is also important to allow for this kind of overhead on the CD-ROM itself. Although CD-ROMs are generally created with a 2Kb sector size, the sector size can be reduced on the ISO-9660 image in the premastering phase to as little as 512 bytes. By decreasing the sector size on the TIMIT ISO-9660 image to 512 bytes, potential disc overhead was reduced by about 32 megabytes.
Finally, a hidden source of potential problems lies within the implementation of the utility used to extract the tar file. There are currently a number of tar utilities that have been written and are in use today. Many of these utilities are suboptimal in speed and efficiency. The time required for downloading a tar file can become critical when extracting large numbers of files. Therefore, using the right tar implementation is a must.
The Real "Tar Pit" -- Universal Operability
The real problem facing the CD-ROM industry concerning the production of non-DOS-based discs lies not in which utilities or platforms to use, but within the deeper abyss of universal operability. Universal operability encompasses the common methodology of transferring, publishing, and retrieving many different types of data across different platforms, while using different hardware and software systems. Attempting to extract a tar file into a DOS-based premastering system is a perfect example of why universal operability is the next technical challenge for the CD-ROM industry at large. If this issue is continued to be ignored, entire market segments will be left paralyzed because of the inability to publish information from beginning to end without experiencing compatibility problems. This bleak scenario could result in the CD-ROM industry losing the acceptance and respect it has worked hard to gain.
The Challenge Ahead
This article has illustrated some of the potential problems, which can result when using the UNIX tar format as a data submission medium for CD-ROM replication. More importantly, it has shown that a much greater variety of CD-ROM applications could blossom if the CD-ROM industry embraces a diversification of CD-ROM platforms. The ISO-9660 standard has provided a good basis for the exchange of CD-ROMs across different hardware and software platforms. It is now time for the CD-ROM industry to address and overcome the many obstacles faced by the challenge of universal operability. The increasing need for a standard media- and platform-independent format for data submission is just one such obstacle. In the short term, manufacturers of CD-ROM premastering workstations should publish specifications indicating the limitations of their systems. This would allow publishers and replicators of "atypical" CD-ROMs to avoid many of unforeseen pitfalls they must now face. In the long term, these premastering systems must be made more robust.
The next generation of CD-ROM publishers and users will help CD-ROM technology reach new heights, but they will become far less forgiving as CD-ROM becomes more commonplace. For NIST, the UNIX road to CD-ROM has certainly been "the road less traveled." Currently, the development, production, and use of CD-ROM technology in UNIX and other environments is still in its infancy. However, by increasing support for development and production in these environments, CD-ROMs may someday be produced and used on a variety of platforms as easily as they are on MS-DOS-based systems today. It is only in this way that the CD-ROM will become the truly universal medium of data exchange that it was intended to be.
The authors wish to thank the following people which have helped them in their quest for solutions to the problems this article has outlined: Joe Bradley and Clayton Summers at Philips and Dupont 10Helgerson, L. W., "Universal Operability: The Technical Solution", Disc Magazine, pp. 36-39, October 1990. Optical Co., Dennis Clark, formerly of Meridian Data, Inc., Leon Whidbee and Gisele Venczel at Disc Manufacturing, Inc., Lance Buder and Sylvester Pefek at Optical Media International, and Tom Brown at Reflective Software.