Objectives:
- Recognize common databases and know industry application
- Describe what SQL and noSQL mean
- Describe pros/cons of SQL and noSQL
- Know where the database lives (remote vs local)
Why use databases?
- They are more robust than text, CSV or JSON files
What are databases?
- Systems that manage the storage, querying (inquiring) and retrieving of data where you can write rules that ensure proper data management and verification.
What’s an example of a daily life use of a database?
- Bank database
What are the most common families of databases?
- Relational and non-Relational.
Why are databases important in analysis?
- For pulling data to and from a resource
How to retrieve data?
- Using SQL
What is the name given to the act of retrieving data?
- Transaction
Why manage a database?
To ensure…
- consistency (what if two nodes try to read/edit the file at the same time?)
- availability (what if a node is not connected to the central bank?)
- partition tolerance (what if only part of the file is available?)
- scale (what if too many nodes request data from the file at the same time?)
How do databases ensure that?
By being resilient to these problems…
- Database must know when a transaction begins
- when it ends
- what to do if it never ends
- what to do if another transaction is requested, while the previous one is still going.
More specifically?
There is a set of properties (ACID) in a database that guarantee reselience
- Atomicity
- Consistency
- Isolation
- Durability
What are them in detail?
- Atomicity: Each transaction is “all or nothing”, if one part of the transaction fails, the entire transaction fails
- Consistency: Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof.
- Isolation: Executes transactions serially i.e. one after another
- Durability: Once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors.
What’s an example of a database that meets those conditions?
- Relational database
What is a relational database?
- Collection of tables (a table is similar to single CSV file or Pandas DataFrame) that correspond to concepts
For example?
- In Twitter, our two main concepts are Users and Tweets. For each of these we would have a table
Why use relational databases instead of a CSV file then?
- Because each of its tables types are specified by schemas and that makes databases stronger than CSV files
How are tables types specified?
- Table has schema, set of rules for what each table is and contains (like a Pandas DataFrame header): integers, text, float, etc.
Are columns and rows of the tables specified as well?
- Yes
How?
- Each table has a primary key column with a unique value for each row
What are the rest of columns?
- Foreign key columns
What are they for?
- Contain values to link the table to the other tables
Example?
The tweets table may have as columns:
- tweet_id, the primary key tweet identifier, the tweet text, the user id of the member, a foreign key to the users table
How to implement relational databases?
- Use MySQL or PostgreSQL, because they are open source. Some companies use Oracle or MicrosoftSQL
What are other types of of databases that are not relational?
- Key-value stores: Very large and very fast python-like dictionaries
- Sample topics: Image stores, key-based filesystems, object cache, systems designed to scale
- Implementations: Cassandra, Redis or
memcachedb
2. No SQL: More flexible (Ex: for each user, we may store their metadata, along with a collection of tweets, each of which has its own metadata) databases with SQL capabilities
- Sample topics: high-variability data, document search, integration hubs, web content management, publishing, and many more
- Implementations:
mongodb
andcouchdb
3. Timeseries:
- Databases for handling time series (data indexed by time)
Examples of these databases?
- Atlas, Druid, InfluxDB, Splunk
Examples of topic in which you should use them?
- Stock market data, energy load data from a utility company, server metrics, purchase history, website metrics, ads and clicks, sensor data from a wearable device or an internet-of-things sensor, and smartphone sensor data
When do you need the No SQL?
- Different products may have different properties, a rigid SQL schema makes hard to add new products
4. Graph Databases
- Key-value databases with the addition of the relationship concept (Databases that store networks data)
What to use them for?
- Finding communities
- Finding shortest path between two entities
- Detecting fraudolent behavior
- Establishing user identity
What are examples of problems and their respective databases needed?
- Database for an application with user profiles
- Probably relational DB. It could be a ‘noSQL’ database if user profile fields are commonly updating or we want to make it easy to do complex queries on user profile elements.
2. Database for an online store
- Probably a relational DB – Fast storage querying, standard table structures, transactional capabilities
3. Storing last visit date of a user
- A fast key-value store would be useful for caching interesting statistics about users (last visit, total visits)
4. Mobile application that allows peer to peer sharing of messages that have short lifetime.
- Probably a graph database for users and a noSQL or key-value store for messages
5. A hedge fund that needs to record stock market data and run models on them.
- a time series database
In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter)
This is the typical model under which relational databases operate, and it fits perfectly our previous example of the bank.