I Prefer the Database

When I was in college, I went to dinner with a group of friends to a steakhouse. Each friend ordered a steak, well-done. I was the last to order. When asked how I would like my steak, the group egged me on to have it rare because, me being a Texan, knew how steak was meant be. I said to the waiter “well-done”, because I always had it prepared that way and was used to it.

Several months later, with the same group but different restaurant, the Cajun Blackened Steak caught my eye so I decide to order it. When asked how I wanted it cooked, I naturally said well-done. The waiter replied, “But sir, it is only prepared rare or medium rare”. I was horrified, but had my heart set on steak. Taking the least risky choice, I selected medium rare. When the steak came, I cut a small piece. Again, I was horrified. I closed my eyes and but the piece of meat into my mouth.

The meat melted in my mouth like butter. I never knew that beef had so much flavor. I have never order a steak well-done since.

To all the vegans and vegetarians reading this article, I apologize for the example, but the point of it is, even though we may be repulsed or scared of making a change or using a different approach, be open to something new. It may be a pleasant surprise.

Though it seems like a stretch, using a database to prepare your dispersion modeling analysis rather than the file-based approach we are all used to using for the past 30+ years is similar to having a taste of unknown flavors rather than the safe old dry shoe you were used to.

What is a Database? Let’s Start At the Beginning.

I have been using databases for my dispersion modeling projects for over 20 years, but I am by no means an expert on databases. I just know what works well and I am learning more each day. There are a million things to learn about, but let’s start with simple concepts.

The three most basic structures of a database are:

Tables
Fields
Records

Database Tables

Tables are the building blocks of the database. All data is housed in tables. A table contains data regarding an item or topic that should be similar. For example, if you wanted to store data on source emission rates, the data you would include would be the source name, the air contaminants being emitted, the emission rate of each air contaminant from the source, and the units of measure of the emission rate. Including information on source location or air contaminant molecular weight may not be so appropriate for emission rates.

A database table is made up of columns and rows similar to what you see in a spreadsheet. Remember: a database table is NOT a spreadsheet.

Data Fields

Data fields are the columns of the database table. Fields are the names of the properties of the item described by the database table. As in the example above, the properties of the emission rates would be the source name or ID, air contaminant name, the rate value, and the units of the value.

When constructing a database table, you should consider which properties are really important to store. As in the example above, are source location and air contaminant molecular weight really relevant to emission rates or to something else?

In order to completely describe an item, data fields must be assigned a data type to describe the kind of information the field will contain. For example, a source name or ID could be a text string or a numeric value. Depending upon how the data are used or received (if the data are coming from another source), one data type would be more appropriate than the other.

Data fields expose the power and usefulness of using a database. Remember that computers are not intelligent. They do what they are told. If a database is expecting a value to be a number and the value contains a letter, an error will be raised and not allow it to be stored. If the database were allowed to store any value of any type, you will end up with junk. We have all heard of garbage in, garbage out. Using databases, if used properly, eliminates garbage in.

The list of available data types varies between database platforms, so check which ones are supported by the one you choose. Basically you have numeric types, such as integers, long integers, floating point, single precision, double precision, and even currency. There are string or text types like short text, long text, and variable length strings. There are other data types such as date, time, date and time, binary or Boolean, and more.

Between the data table and data fields, the basic structure of the database is defined. The better you define the data structure, the more likely there will be problems with using the data. However, much time should be spent on considering how and who will be using the data and for what the data are being used. Remember, computers are not intelligent so you need to consider as many use cases as possible when designing a database. If you can capture 80% of the use cases, that is good, but do come up with a means to capture the other 20% to be dealt with by other means.

Data Records

Data records are the rows of the database table. They contain the actual data in the table. The record is made up of the values for each data field in the table. It is a collection of data elements.

The order the records are stored or appear depends on a few things that we will not get into at this point. The database platform keep track of all the records internally.

The important thing to remember when it comes to data records is that a database is NOT a spreadsheet. When values are being entered into a database table, the data is not stored until the entire record has been completed and committed. In a spreadsheet, you can navigate to another row or cell at any time, but with a database, you have to be all in and then you are in. It’s the difference between dating and being married.

Best Practices

Best practices are suggestions, but if you intend to share your database with anyone else, you may want to seriously consider using each one.

Start table names with tbl
Do not use spaces or “-” for field and table names
Use underscore or lowercase/upper case for names, e.g. last_name, LastName
If you are importing data from another source, have all data fields typed as text long enough to accommodate all elements

For the first bullet, it is just good practice to clear identify tables, especially when your databases start getting really big and you have a variety of different structures. Also, this practice has been around as long as I have come to know databases, so it is an unofficial standard.

Explaining the second and third bullets, most database platforms do not like spaces (” “) or dashes (“-“). To accommodate the case when they are used, square brackets ( [ ] ) will need to enclose the table and field names. This gets to be a little messy when you start writing queries and other code referencing database tables and fields.

Regarding the last bullet, I have just learned this from experience and was highly suggested to me by real database experts. If you are importing data from outside data sources, you have no control over the quality of the data being imported. If data types do not match exactly, the import process will fail. If you are importing data programmatically, this can become a big headache quickly. A text field can hold just about any value between two double quotes. You can always convert data type when the data are actually being used.

Terms and Features To Know

People’s behavior fails to amaze me. This can be both positive or negative. One term/concept that database novices have trouble with is null. Null is not a space or the word “null” or a zero (0). All three of those are non-null values. Null is nothing; the absence of anything.

Null Island is an example of how misunderstanding the concept of NULL leads to interesting results in data analysis. Null Island is a fictional island located at Latitude, Longitude coordinates 0, 0. But don’t think using other coordinate standards will magically fix these coordinate issues. This site shows other Null Islands in other coordinate systems.

To me, the most powerful feature/property of databases, which I use synonymously with relational databases, is relationships can be built into the structure. This helps keep the data consistent between tables that have overlapping data. To help enforce stricter consistency rules, referential integrity can employed between tables. This feature prevents bad data getting into the database.

Another feature to know about are indexes. Indexes helps the database find data more quickly when you query the data. We will get more into queries in later articles.

Summary

Most modelers still use the file-based approach of constructing input files and process output files even though using a database is far superior. The most likely reason is the intimation factor when it comes to databases. Plus they are just different. However, once a modeler uses a database for their modeling project, they wonder why they didn’t do it sooner and never go back to the file-based world.

The main structures of a database are:

Tables; they house the data and are the building block of a database
Fields; they comprise the “columns” in a table and list the properties of the data captured in the table
Records; they comprise the “rows” and contain the data elements for each field value. A record is the collection of all the fields values or a complete row.

Fields must be assigned a data type so the database platforms know what kind of value to expect. This helps keep data clean.

Databases are far superior to storing data in a multitude of text files or spreadsheets. The reason is database allow for rules to be constructed regarding what acceptable data should look like and enforces those rules. These rules make analysis and manipulation of the data much easier and quicker.

If you found this article informative, there is more helpful and actionable information for you. Go to http://learn.naviknow.com to see a list of past webinar mini-courses. Every Wednesday (Webinar Wednesday), NaviKnow is offering FREE webinar mini-courses on topics related to air quality dispersion modeling and air quality permitting. If you want to be on our email list, drop me a line at [email protected].

One of the goals of NaviKnow is to create an air quality professional community to share ideas and helpful hints like those covered in this article. So if you found this article helpful, please share with a colleague.

Using Databases with Dispersion Modeling: The Basics