CS4: INTRODUCTION TO DATABASES

This Unit is about the Introduction to Databases.

Google defines database as “a structured set of data held in a computer, especially one that is accessible in various ways.” At its most basic, a database is just a way of storing and organizing information. Ideally it is organized in such a way that it can be easily accessed, managed, and updated.

I like metaphors, so this simple definition of a database for me is like a toolbox. You’ve got lots of screws, nails, bits, a couple different hammers… A toolbox is a storage system that allows you to easily organize and access all of these things. Whenever you need a tool, you go to the toolbox. Maybe you have labels on the drawers — those will help you find, say, a cordless power drill. But now you need the right battery for the drill. You look in your “battery” drawer, but how do you find the one that fits this particular drill? You can run through all of your batteries using trial and error, but that seems inefficient. You think, ‘Maybe I should store my batteries with their respective drills, link them in some way.’ That might be a viable solution. But if you need all of your batteries (because you’re setting up a nice new charging station maybe?), will you have to access each of your drills to get them? Maybe one battery fits multiple drills? Also, toolboxes are great for storing disjointed tools and pieces, but you wouldn’t want to have to take your car apart and store every piece separately whenever you park it in the garage. In that case, you would want to store your car as a single entry in the database, and access its pieces through it.

At least I can easily find the alternator this way, right?

This example is contrived, but reveals some issues you’ll have to consider when choosing a database or how to store your data within it.

Let’s learn the lingo.

If you start directionlessly googling “databases”, you’ll soon realize there are several different types and lots of terminology surrounding them. So let’s try and clear up any potential language barriers.

While I’m sure someone has written books on each of these, I’ll try to keep my definitions relatively simple. These were all terms that I came across while doing this research that I thought could use some quick explanation.

  • Query
    – A query is a single action taken on a database, a request presented in a predefined format. This is typically one of SELECT, INSERT, UPDATE, or DELETE.
    – We also use ‘query’ to describe a request from a user for information from a database. “Hey toolbox, could you get me the names of all the tools in the ‘wrenches’ drawer?” might look something like SELECT ToolName FROM Wrenches.
  • Transaction
    A transaction is a sequence of operations (queries) that make up a single unit of work performed against a database. For example, Rob paying George $20 is a transaction that consists of two UPDATE operations; reducing Rob’s balance by $20 and increasing George’s.
  • ACID: Atomicity, Consistency, Isolation, Durability
    In most popular databases, a transaction is only qualified as a transaction if it exhibits the four “ACID” properties:
    – Atomicity: Each transaction is a unique, atomic unit of work. If one operation fails, data remains unchanged. It’s all or nothing. Rob will never lose $20 without George being paid.
    – Consistency: All data written to the database is subject to any rules defined. When completed, a transaction must leave all data in a consistent state.
    – Isolation: Changes made in a transaction are not visible to other transactions until they are complete.
    – Durability: Changes completed by a transaction are stored and available in the database, even in the event of a system failure.
  • Schema
    – 
    A database schema is the skeleton or structure of a database; a logical blueprint of how the database is constructed and how things relate to each other (with tables/relations, indices, etc).
    – Some schemas are static (defined before a program is written), and some are dynamic (defined by the program or data itself).
  • DBMS: database management system
    Wikipedia has a great summary: “A database management system is a software application that interacts with the user, other applications, and the database itself to capture and analyze data. A general-purpose DBMS is designed to allow the definition, creation, querying, update, and administration of databases.” MySQL, PostgreSQL, Oracle — these are database management systems.
  • Middleware
    Database-oriented middleware is “all the software that connects some application to some database.” Some definitions include the DBMS under this category. Middleware might also facilitate access to a DBMS via a web server for example, without having to worry about database-specific characteristics.
  • Distributed vs Centralized Databases
    – As their names imply, a centralized database has only one database file, kept at a single location on a given network; a distributed database is composed of multiple database files stored in multiple physical locations, all controlled by a central DBMS.
    – Distributed databases are more complex, and require additional work to keep the data stored up-to-date and to avoid redundancy. However, they provide parallelization (which balances the load between several servers), preventing bottlenecking when a large number of requests come through.
    – Centralized databases make data integrity easier to maintain; once data is stored, outdated or inaccurate data (stale data) is no longer available in other places. However, it may be more difficult to retrieve lost or overwritten data in a centralized database, since it lacks easily accessible copies by nature.
  • Scalability
    Scalability is the capability of a database to handle a growing amount of data. There are two types of scalability:
    – Vertical scalability is simply adding more capacity to a single machine. Virtually every database is vertically scalable.
    – Horizontal scalability refers to adding capacity by adding more machines. The DBMS needs to be able to partition, manage, and maintain data across all machines.
Vertical vs Horizontal Scaling. Little buckets don’t need as much brawn to carry, but they do require better coordination.

Relational (SQL) Databases

  • “relational”
    – I highly recommend this article, which explains, “The word ‘relational’ in a ‘relational database’ has nothing to do with relationships. It’s about relations from relational algebra.
    – In a relational database, each relation is a set of tuples. Each tuple is a list of attributes, which represents a single item in the database. Each tuple (“row”) in a relation (“table”) shares the same attributes (“columns”). Each attribute has a well-defined data type (int, string, etc), defined ahead of time — schema in a relational database is static.
    – Examples include: Oracle, MySQL, SQLite, PostgreSQL
Thanks, Wikipedia.
  • SQL: Structured Query Language
    SQL is a programming language based on relational algebra used to manipulate and retrieve data in a relational database. note: In the bullet above, I’m intentionally separating the relational database terminology (relation, tuple, attribute) from the SQL terminology (table, row, column) in order to provide some clarity and accuracy.
  • Illustrative Example:
    We could store all the data behind a blog in a relational database. One relation will represent our blog posts, each of which will have ‘post_title’ and ‘post_content’ attributes as well as a unique ‘post_id’ (a primary key). Another relation might store all of the comments on a blog. Each item here will also have attributes like ‘comment_author’, ‘comment_content’, and ‘comment_id’ (again, a primary key), as well as its own ‘post_id.’ This attribute is a foreign key, and tells us which blog post each comment “relates” to. When we want to open a webpage for post #2 for example, we might say to the database: “select everything from the ‘posts’ table where the ID of the post is 2,” and then “select everything from the comments table where the ‘post_id’ is 2.”
  • JOIN operations
    – A JOIN operation combines rows from multiple tables in one query. There are a few different types of joins and reasons for using them, but this page provides good explanations and examples.
    – These operations are typically only used with relational databases and so are mentioned often when characterizing “relational” functionality.
  • Normalization and Denormalization
    – Normalization
     is the process of organizing the relations and attributes of a relational database in a way that reduces redundancy and improves data integrity (accurate, consistent, up-to-date data). Data might be arranged based on dependencies between attributes, for example — we might prevent repeating information by using JOIN operations.
    – Denormalization then, is the process of adding redundant data in order to speed up complex queries. We might include the data from one table in another to eliminate the second table and reduce the number of JOIN operations.
  • ORM: Object-Relational Mapping
    ORM is a technique for translating the logical representation of objects (as in object-oriented programming) into a more atomized form that is capable of being stored in a relational database (and back again when they are retrieved). I won’t go into more detail here, but it’s good to know it exists.

Non-Relational (NoSQL) Databases

  • “non-relational”
    At it’s simplest, a non-relational database is one that doesn’t use the relational model; no relations (tables) with tuples (rows) and attributes (columns). This title covers a pretty wide range of models, typically grouped into four categories: key-value stores, graph stores, column stores, and document stores.

No tables.
  • Illustrative Example:
    – When we set up our blog posts and comments in a relational database, it worked in the same way as the drawers of our toolbox. But, much like our drill and battery example, does it make sense to always store our blog posts in one place, and comments in another? They’re clearly related, and it’s probably rare that we’d want to look at a post’s comments without also wanting the post itself. If we used a non-relational database, opening a webpage for post #2 might look something like this: “select post #2 and everything related to it.” In this case, that would mean a ‘title’ and ‘content’, as well as a list of comments. And since we’re no longer constrained by rows always sharing the same columns, we can associate any arbitrary data with any blog posts as well — maybe some have tags, others images, or as your blog grows, you’d like some of your new posts to link to live Twitter streams. With the non-relational model, we don’t need to know ahead of time that all of our blog posts have the same attributes, and as we add attributes to newer items, we are not required to also add that “column” to all previous items as well.
    – This model also works well for the car example from earlier in this post. If you have three cars in your garage, it doesn’t make sense to store all of their tires together, seats together, radiators together… Instead, you store an entire car and everything related to it in its own “document.”
    – However, there may be a downside to this. If you wanted to know how many seats (or comments, or batteries) you have total, you may have to go through every car and count each seat individually. There are ways around this of course, but it’s less trivial than just opening up the “seats” drawer and checking your total, especially on much larger scales.
  • NoSQL
    “NoSQL” originally referred to “non-SQL” or “non-relational” when describing a database. Sometimes “NoSQL” is also meant to mean “Not only SQL”, to emphasize that they don’t prohibit SQL or SQL-like query languages; they just avoid functionality like relation/table schemas and JOIN operations.
  • Key-Value Store
    – Key-value stores don’t use the pre-defined structure of relational databases, but instead treat all of their data as a single collection of items. For example, a screwdriver in our toolbox might have attributes like “drive_type”, “length”, and “size”, but a hammer may only have one attribute: “size”. Instead of storing (often empty) “drive_type” and “length” fields for every item in your toolbox, a “hammer_01” key will return only the information relevant to it.
    – Success with this model lies in its simplicity. Like a map or a dictionary, each key-value pair defines a link between some unique “key” (like a name, ID, or URL) and its “value” (an image, a file, a string, int, list, etc). There are no fields, so the entire value must be updated if changes are made. Key-value stores are generally fast, scalable, and flexible.
    – Examples include: Dynamo, MemcacheDB, Redis
  • Graph Store
    – Graph stores are a little more complicated.Using graph structures, this type of database is made for dealing with interconnected data — think social media connections, a family tree, or a food chain. Items in the database are represented by “nodes”, and “edges” directly represent the relationships between them. Both nodes and edges can store additional “properties”: id, name, type, etc.
Something like this.

– The strength of a graph database is in traversing the connections between items, but their scalability is limited.
– Examples include: Allegro, OrientDB, Virtuoso

  • Column Store
    – Row-oriented databases describe single items as rows, and store all the data in a particular table’s rows together: ‘hammer_01’, ‘medium’, ‘blue’; ‘hammer_02’, ‘large’, ‘yellow’. A column store, on the other hand, generally stores all the values of a particular column together: ‘hammer_01’, ‘hammer_02’; ‘medium’, ‘large’; ‘blue’, ‘yellow’.
    – This can definitely get confusing, but the two map data very differently. In a row-oriented system, the primary key is the row ID, mapped to its data. In the column-oriented system, the primary key is the data, mapping back to row IDs. This allows for some very quick aggregations like totals and averages.
    – Examples include: Accumulo, Cassandra, HBase
  • Document Store
    – Document stores treat all information for a given item in the database as a single instance in the database (each of which can have its own structure and attributes, like other non-relational databases). These “documents” can generally be thought of as sets of key-value pairs: {ToolName: “hammer_01”, Size: “medium”, Color: “blue”}
    – Documents can be independent units, which makes performance and horizontal scalability better, and unstructured data can be stored easily.
    – Examples include: Apache CouchDB, MongoDB, Azure DocumentDB.
  • Object or Object-Oriented Database
    Not as common as other non-relational databases, an object or object-oriented database is ones in which data is represented in the form of “objects” (with attributes and methods) as used in object-oriented programming. This type might be used in place of a relational database and ORM, and may make sense when the data is complex or there are complex many-to-many relationships involved. Beware its language dependence and difficulty with ad-hoc queries though.

FEATURES OF DATABASES

Here’s some important features that we think all databases should have:

  1. Offering text standardization function to help with (1) data cleaning, (2) reducing volume of text information, (3) merging data from different sources or having different character sets
  2. Ability to categorize information (text in particular, and tagged data more generally), using built-in or ad hoc taxonomies (with a customized number of categories and subcategories), together with clustering algorithms. A data record can be assigned to multiple categories.
  3. Ability to efficiently store images, books, records with high variance in length, possibly though an auxiliary file management system and data compression algorithms, accessible from the database.
  4. Ability to process data remotely on your local machine, or externally, especially computer-intensive computations performed on a small number of records. Also, optimizing use of cache systems for faster retrieval.
  5. Offer SQL to NoSQL translation and SQL code beautifier.
  6. Offer great visuals (integration with Tableau?) and useful, efficient dashboards (integration with Birt?)
  7. API and Web/browser access: database calls can be made with HTTPS requests, with parameters (argument and value) embedded in the query string attached to the URL. Also, allow recovery of extracted / processed data on a server, via HTTPS calls, automatically. This is a solution to allow for machine-to-machine communications.
  8. DBA tools available to sophisticated users, such as fine-tuning query optimization algorithms, allowing hash joins, switching from row to column database when needed (to optimize data retrieval in specific cases).
  9. Real time transaction processing and built in functions such as computations of “time elapsed since last 5 transactions” at various levels of aggregation.
  10. Ability to automatically erase old records and keep only historical summaries (aggregates) for data older than (say) 12 months.
  11. Security (TBD)

Note that some database systems are very different from traditional DB architecture. For instance, I created a web app to find keywords related to keywords specified by a user. This system (it’s more like a student project than a real app, though you can test it here) has the following features:

  • It’s based on a file management system (no actual database)
  • It is a table with several million entries, each entry being a keyword and a related keyword, plus metrics that measure the quality of the match (how strong the relationship between the two keywords is), as well as frequencies attached to these two keywords, and when they are jointly found. The function to measure the quality of the match can be customized by the user.
  • The table is split into 27 x 27 small text files. For instance, the file cu_keywords.txt contains all enties for keywords starting with letter cu (it’s a bit more complicated than that, but this shows you how I replicated the indexing capabilities of modern databases).
  • It’s running on a shared server, at it peaks hundreds of users were using it and the time to get an answer (retrive keyword related data) for each query was less than 0.5 second – most of that time spent on transferring data over the Internet, with very little time used to extract the data on the server.
  • It offers API and web access (and indeed, no other types of access)

IMPORTANCE OF DATA MANAGEMENT SYSTEMS

A database management system is important because it manages data efficiently and allows users to perform multiple tasks with ease. A database management system stores, organizes and manages a large amount of information within a single software application. Use of this system increases efficiency of business operations and reduces overall costs.

Database management systems are important to businesses and organizations because they provide a highly efficient method for handling multiple types of data. Some of the data that are easily managed with this type of system include: employee records, student information, payroll, accounting, project management, inventory and library books. These systems are built to be extremely versatile.

Without database management, tasks have to be done manually and take more time. Data can be categorized and structured to suit the needs of the company or organization. Data is entered into the system and accessed on a routine basis by assigned users. Each user may have an assigned password to gain access to their part of the system. Multiple users can use the system at the same time in different ways.

For example, a company’s human resources department uses the database to manage employee records, distribute legal information to employees and create updated hiring reports. A manufacturer might use this type of system to keep track of production, inventory and distribution. In both scenarios, the database management system operates to create a smoother and more organized working environment.

A simple database has a single table with rows for the data and columns that define the data elements. For an address book, the table columns define data elements such as name, address, city, state and phone number, while a table row, or record, contains data for each person in the book. The query language provides a way to find specific types of data in each record and return results that match the criteria. These results display in a form that uses the defined data elements but only shows records that meet the criteria. These three components make up almost every type of database.

Relational databases use multiple tables and define relationships between them using a schema in addition to data elements. Records and data elements from each table merge, based on the query, and display in the form. Routinely used queries often become reports. A report uses the same query but reports on changes in data over time.

There are five major components in a database environment: data, hardware, software, people and procedures. The data is a collection of facts, typically related. The hardware is the physical devices in the database environment. Operating systems, database management systems and applications make up the software. Examples of people in the database environment are the system administrator, programmers and end users. Procedures are the instructions and rules for the database.

Database Management System is quite useful compared to the file based management system. However, it does have some disadvantages. Some of those are as follows:

More Costly

Creating and managing a database is quite costly. High cost software and hardware is required for the database. Also highly trained staff is required to handle the database and it also needs continuous maintenance. All of these ends up making a database quite a costly venture.

High Complexity

A Database Management System is quite complex as it involves creating, modifying and editing a database. Consequently, the people who handle a database or work with it need to be quite skilled or valuable data can be lost.

Database handling staff required

As discussed in the previous point, database and DBMS are quite complex. Hence, skilled personnel are required to handle the database so that it works in optimum condition. This is a costly venture as these professionals need to be very well paid.

Database Failure

All the relevant data for any company is stored in a database. So it is imperative that the database works in optimal condition and there are no failures. A database failure can be catastrophic and can lead to loss or corruption of very important data.

High Hardware Cost

A database contains vast amount of data. So a large disk storage is required to store all this data. Sometimes extra storage may even be needed. All this increases hardware costs by a lot and makes a database quite expensive.

Huge Size

A database contains a large amount of data, especially for bigger organisations. This data may even increase as more data is updated into the database. All of these leads to a large size of the database.

The bigger the database is, it is more difficult to handle and maintain. It is also more complex to ensure data consistency and user authentication across big databases.

Up gradation Costs

Often new functionalities are added to the database.This leads to database up gradations. All of these up gradations cost a lot of money. Moreover it is also quite expensive to train the database managers and users to handle these new up gradations.

Cost of Data Conversion

If the database is changed or modified in some manner, all the data needs to be converted to the new form. This cost may even exceed the database creation and management costs sometimes. This is the reason most organisations prefer to work on their old databases rather than upgrade to new ones.

THIS LINK EXPLAINS MORE

https://smallbusiness.chron.com/disadvantages-business-databases-33951.html

THIS VIDEO EXPLAINS MORE ABOUT DATABASES

SEE ALLAdd a note
YOU
Add your Comment
 

DOWNLOAD YAAKA DN APP









X