In this series we are exploring Microsoft Excel from a data governance perspective. Many organizations use Excel heavily as the primary place where data is entered. This is understandable for younger organizations that are still trying to figure out best practices. However, Excel has many limitations that other database platforms do not, and maturing organizations should re-examine their use of Excel as a means of data entry and storage.
Why is Excel not a great database?
Excel allows too much data redundancy - One of the key principles of good database design is that data should only be maintained in one place, though it can be referenced from many places. In relational databases this is enforced using tables and keys. A table will be the key store of one piece of information, and a unique identifier/key will allow other tables to reference that information without copying all the data. Excel, with its multiple worksheet structure, does not enforce well the concept of keys and storing data in one place only.
What often happens in an excel spreadsheet is that the same data is copied from one worksheet to another, or even from one spreadsheet to another. As the data is moved into different locations it is inevitably changed and then it becomes difficult to know what the "authoritative" source of data is.
Excel does not allow for robust data type enforcement - Another reason why Excel is not a great database is that it is difficult to force constraints onto column data. In a relational database if you set a field constraint the underlying system enforces that constraint rigidly to the extend that it does not permit data to be saved that does not fit the data type of the column. In Excel you can set a data type into a column, but all that means in reality is that Excel tries to massage whatever you type into that column format. In a spreadsheet with thousands of rows it's hard to find the data points that Excel might have interpreted wrongly.
Relational databases also have features that allow the user to specify that a column can only contain certain values. Most often this means that a field cannot under any circumstances be blank. Other times it means that the user should only choose a fixed range of values. Excel does not have easy tools to force these constraints.
Excel makes true collaboration on data tricky - Relational databases have a robust transactional system which allows people to operate only on parts of the database that are not currently being changed and to group operations together into units called transactions. These changes, once committed, can then can be propagated to everyone else working on the database in real time. This ensures that two people don't make contradictory changes to the database.
Collaboration using Microsoft Excel often means copying the spreadsheet to multiple people who each make their own changes (or even submit their own spreadsheets), and these spreadsheets are merged together into a master spreadsheet by another person. Because there is no one "authoritative" version of the spreadsheet it is easy to end up with confusion as people make changes in various places.
Excel is hard to secure - This might be the biggest reason of all why widespread use of Excel can be a data governance nightmare. If an Excel spreadsheet is not password protected (and most of them are not), all it takes to expose the data is for someone to obtain access to the file in question. This happens in several ways - someone can accidentally send the file to the wrong party, someone can physically or through the network get access to the computer where the spreadsheet is stored. Excel spreadsheets often find their way onto USB keys as well and these are misplaced with alarming frequency.
Relational databases such as SQL Server, Postgres, and Oracle are somewhat easier to secure. In these cases there is only one place where the data is stored - presumably on a network or cloud server. This provides two layers of protection around the data - network security and database security.
Backups of relational databases are stored in a format where one must restore the database to a compatible version of the server before one can access the data. This means that it requires a bit of extra sophistication to get access to the data - even if someone gets access to a backup of the data they have to find a way to restore it on a comparable server before they can look at the data. Encrypted backups mitigate this issue even further. This is in contrast to the ease of getting access to data in an Excel spreadsheet - one must simply have a copy of Microsoft Excel on their machine to open a spreadsheet, and almost everyone has Excel.