Making Finding Aids for the WWW

..or Adventures in extracting HTML from Microsoft Access

3. Exporting your files to databases

So, you want to get your word-processed finding aids into an Access database, how do you start? First, ask yourself this question - 'What types of information does this finding aid contain?', or perhaps more importantly, 'To which entities does this information refer?'. For example, you may be able to identify the following components within your finding aid:
  • Historical note
  • Series numbers
  • Series titles
  • File numbers
  • File titles
  • File descriptions
  • File date ranges
It's not hard to see that their are three entities referred to: the creator of the records, described in the 'Historical note'; the series; and the files. Data on each of these entities will need to be stored in separate tables, and so will have to be imported separately. (The Tabularium WWW site provides more detailed notes on 'entities'.) Note, however, that there are likely to be links between these tables, for example you will probably want to store a 'Series number' in the 'Files' table to show to which series a particular file belongs.

Each of the categories identified in the example above will correspond to fields within the tables, but you will have to decide how much to want to break up your data. The 'File date range' might be separated out into 'Start date' and 'End date'. Obviously, the more you break things down the more fine-grained will be your control over the data, but the more work you'll have to do to get it into the database. I'm not going to go into the design of databases here as that's a whole new topic. I'm just going to take it for granted that you have an appropriate database structure set up. You might want to look at Joanne Evan's paper on the structure of ADS, and also at the Tabularium manual which provides some notes on importing your data.

Some information might be able to be cut-and-pasted directly from your word processor to an Access table, such as if you have only one provenance entry or historical note. But most of your data will be in the form of lists, with each item in the list corresponding to a record in the database table. How do you transfer whole lists across? You simply have to put your list into a format that the database can recognise. To do this you need to make it clear to the database where the boundaries are between fields, and between records. The most common way of doing this is by using a comma-delimited format - this means there will be commas between fields, text will be enclosed in quotes, and records will be separated by carriage returns. Time for an example:

In word-processing format -
File 1 - Dining habits of penguins, 1923-1978 (3cm) Box 1
File 2 - Correspondence relating to The Hungry Penguin, 1956-1960 Box 1

In comma-delimited format -
"1","Dining habits of penguins", "1923", "1978","3","1"
"2", "Correspondence relating to <I>The Hungry Penguin</I>","1956", "1960", "", "1"

There are a few things to notice here. First, there's no need to transfer repetitive information such as "File" or "Box", you can get the database to insert such labels as you require them. You'll notice that the second file doesn't have any quantity information, nonetheless there is an empty field in the comma-delimited version where the quantity would have been. When putting text into comma-delimited form there are two very important rules to keep in mind:

  1. Each record should have the same number of fields (even if some are empty)
  2. The fields for each record should be in the same order
If you don't follow these rules the database will get confused when you try and import the file and spit errors at you.

Another thing to note in the above example is the treatment of italics. In general, if you want italics to appear in your HTML finding aid, you will need to insert HTML tags (<I> and </I>) around the italicised words. The same goes for other types of character formatting, such as bold. It's not difficult, in fact I've created a Word macro that will do it all for you. Notice that although the box numbers in the example were italicised, there are no HTML tags around them in the comma-delimited version. This is because the formatting is being applied to the whole field, not just a part of it. You can do this from within the database. Of course, the use of HTML tags within fields contradicts my argument about keeping the data and the markup separate, but the problem comes about because Access doesn't allow you to italicise data within fields (as opposed to formatting a whole field as italics). So if you want to preserve your formatting you have to have some way of indicating italics, and the HTML tags are as a good a way as any. If you wanted to go the other way and export your data from a database into a Word document, you could run another macro to remove the tags and italicise the enclosed words.

Similarly, you might want to insert HTML paragraph tags (<P>) in text fields to preserve your paragraph breaks. Here, however, you have a choice, as Access will preserve your para breaks, and you can therefore translate them to HTML on the fly and avoid having to hard-code them (you can look at some code I use to do this).

One more important rule! Because double quotes are used to enclose text, they can't be used within text fields. This means that you should run a replace operation to change all the double quotes into single quotes. If you don't, you'll have lots of errors to deal with when you try to import the file into Access.

Going through the whole document inserting quotes and commas all over the place would be a very time-consuming business, and would probably produce lots of errors. It's usually not too difficult to automate the process using macros and find-and-replace operations. Not surprisingly, the complexity of the operation will depend on the quality and consistency of the original listing. The way that I have approached such tasks is first to examine the formatting to find some way of uniquely identifying breaks between records - this may be something as simple as two paragraph breaks in a row. Then I look for ways of identifying the boundaries between fields - in the above example the combination "number - space - hyphen- space" identifies the boundary between the file number and the file title. Labels such as "File" and "Box" may also provide useful markers. As long as these markers are consistent throughout the listing they can be used to position the commas, quotes and carriage returns you need to convert it into a comma-delimited file. I find it easiest to have the macro select each record in turn and then open it a new document. This makes it easier to operate on. Then once you've processed it, you can close the file and paste its contents into the comma-delimited file.

Things start to get complicated if there are different numbers of fields for different records. If this is the case you need some way of working out which fields are missing and then inserting an empty field marker in the appropriate location. It sounds hard, but it can be done. The Directory of Archives in Australia was a case like this. I received the contents as a series of Word files, formatted as it appears in the printed version. You'll notice that the amount and categories of information for each repository vary considerably. However, because the listing was basically consistent in the way it used its categories I could check each record to see what fields were used and insert empty fields where appropriate. There were various other tricky things I had to do with the Directory data as well so that the lists of holdings and publications could be stored in a separate but linked database table, but I won't go into the details. Other resources that I have converted from word processing files into databases and then to HTML include:

As you can see, they're a real mixed bag. It gives an idea of the flexibility of the technique.

Once you have all your data in a comma-delimited format you then need to save the file as plain text (ASCII). Then it's just a matter of importing the file into Access. The procedure for doing this varies between Access versions, just check the help files, it's pretty straightforward. In all likelihood you'll get a few errors, usually things like renegade quotation marks. The error should tell you the line it occurred on, so then you can just open up the comma-delimited text file again, go to the appropriate line and see what's going on. Eventually you'll track down all the errors and all your data will be in the database!

Next Section - 4. Producing HTML from databases

1. Introduction
2. Why use databases?
3. Exporting your files to databases
4. Producing HTML from databases
4.1 Export to rtf method
4.2 The module method
4.2.1 Contents page
4.2.2 Item listings
4.2.3 The results

Screen capture of ASA Directory
Directory of Archives in Australia.
Converted from Word to Access!

Some Word macros
you may find useful:

View code:


Created by Tim Sherratt (Tim.Sherratt@asap.unimelb.edu.au)
Last modified: 16 March 1998