What are databases?

Objectives:

  • Recognize common databases and know industry application
  • Describe what SQL and noSQL mean
  • Describe pros/cons of SQL and noSQL
  • Know where the database lives (remote vs local)

Why use databases?

  • They are more robust than text, CSV or JSON files

What are databases?

  • Systems that manage the storage, querying (inquiring) and retrieving of data where you can write rules that ensure proper data management and verification.

What’s an example of a daily life use of a database?

  • Bank database

What are the most common families of databases?

  • Relational and non-Relational.

Why are databases important in analysis?

  • For pulling data to and from a resource

How to retrieve data?

  • Using SQL

What is the name given to the act of retrieving data?

  • Transaction

Why manage a database?

To ensure…

  • consistency (what if two nodes try to read/edit the file at the same time?)
  • availability (what if a node is not connected to the central bank?)
  • partition tolerance (what if only part of the file is available?)
  • scale (what if too many nodes request data from the file at the same time?)

How do databases ensure that?

By being resilient to these problems…

  • Database must know when a transaction begins
  • when it ends
  • what to do if it never ends
  • what to do if another transaction is requested, while the previous one is still going.

More specifically?

There is a set of properties (ACID) in a database that guarantee reselience

  • Atomicity
  • Consistency
  • Isolation
  • Durability

What are them in detail?

  • Atomicity: Each transaction is “all or nothing”, if one part of the transaction fails, the entire transaction fails
  • Consistency: Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof.
  • Isolation: Executes transactions serially i.e. one after another
  • Durability: Once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors.

What’s an example of a database that meets those conditions?

  • Relational database

What is a relational database?

  • Collection of tables (a table is similar to single CSV file or Pandas DataFrame) that correspond to concepts

For example?

  • In Twitter, our two main concepts are Users and Tweets. For each of these we would have a table

Why use relational databases instead of a CSV file then?

  • Because each of its tables types are specified by schemas and that makes databases stronger than CSV files

How are tables types specified?

  • Table has schema, set of rules for what each table is and contains (like a Pandas DataFrame header): integers, text, float, etc.

Are columns and rows of the tables specified as well?

  • Yes

How?

  • Each table has a primary key column with a unique value for each row

What are the rest of columns?

  • Foreign key columns

What are they for?

  • Contain values to link the table to the other tables

Example?

The tweets table may have as columns:

  • tweet_id, the primary key tweet identifier, the tweet text, the user id of the member, a foreign key to the users table

How to implement relational databases?

  • Use MySQL or PostgreSQL, because they are open source. Some companies use Oracle or MicrosoftSQL

What are other types of of databases that are not relational?

  1. Key-value stores: Very large and very fast python-like dictionaries
  • Sample topics: Image stores, key-based filesystems, object cache, systems designed to scale
  • Implementations: Cassandra, Redis or memcachedb

2. No SQL: More flexible (Ex: for each user, we may store their metadata, along with a collection of tweets, each of which has its own metadata) databases with SQL capabilities

  • Sample topics: high-variability data, document search, integration hubs, web content management, publishing, and many more
  • Implementations: mongodb and couchdb

3. Timeseries:

  • Databases for handling time series (data indexed by time)

Examples of these databases?

  • Atlas, Druid, InfluxDB, Splunk

Examples of topic in which you should use them?

  • Stock market data, energy load data from a utility company, server metrics, purchase history, website metrics, ads and clicks, sensor data from a wearable device or an internet-of-things sensor, and smartphone sensor data

When do you need the No SQL?

  • Different products may have different properties, a rigid SQL schema makes hard to add new products

4. Graph Databases

  • Key-value databases with the addition of the relationship concept (Databases that store networks data)

What to use them for?

  • Finding communities
  • Finding shortest path between two entities
  • Detecting fraudolent behavior
  • Establishing user identity

What are examples of problems and their respective databases needed?

  1. Database for an application with user profiles
  • Probably relational DB. It could be a ‘noSQL’ database if user profile fields are commonly updating or we want to make it easy to do complex queries on user profile elements.

2. Database for an online store

  • Probably a relational DB – Fast storage querying, standard table structures, transactional capabilities

3. Storing last visit date of a user

  • A fast key-value store would be useful for caching interesting statistics about users (last visit, total visits)

4. Mobile application that allows peer to peer sharing of messages that have short lifetime.

  • Probably a graph database for users and a noSQL or key-value store for messages

5. A hedge fund that needs to record stock market data and run models on them.

  • a time series database

 

 

 

 

 

 

 

In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter)

This is the typical model under which relational databases operate, and it fits perfectly our previous example of the bank.

 

 

 

 

 

Leave a comment