Trials of an Upper Division Computer Science Student: June 2021

Tuesday, June 29, 2021

CST 334 - Week 1

The First Week

Here we go! Introduction to Operating Systems has officially begun. My mentor told me that this would be the most difficult class in my journey to a bachelor's degree, but Dr. Tao disagrees. Since the latter is the architect of the program, I will take his word for it. Nonetheless, I have been mentally preparing for this class for awhile and I am ready to knock it out. I am extremely giddy about learning operating systems.

An Abstraction of Hardware

This week I learned that an operating system is an abstraction of the hardware. The hardware itself has limitations, but an operating system helps overcome these limitations by virtualizing everything (CPU, Disk, RAM) and becoming a resource manager, delicately balancing all 3.

Virtual Memory is probably the most well known. The operating system creates an illusion of unlimited memory for an application by using both the RAM and Disk as memory locations (storing frequently accessed values in RAM and lesser-accessed values on the disk temporarily).

Virtualization of the CPU allows multiple processes to take turns running on the processor, creating an illusion of multiple processors. This is accomplished through process scheduling, which allows processes which are waiting on something else (like input) to wait in a queue while another process takes a turn on the CPU.

Finally, virtualization of the disk occurs in the form of a file system. There is no file system on the physical drive, so the operating system must read/write the disk in a certain way (by following the specifications of the file system).

Friday, June 18, 2021

CST 363 - Week 8

Final Thoughts

This class has been challenging but fun. I had never used databases before, yet by week 3 I was expected to build a database from scratch with my team. I understand why some struggled with the course as it really felt "sink or swim" at times. Fortunately I was able to apply my previous experience and knowledge to help me better visualize the material each week. Databases are just the practical side to some of the data structures we've been exposed to, and I am now looking forward to my data science class.

My primary takeaway from this class is simply how to use databases within my applications. I really enjoyed the week 4 assignment (Java web app). It opened my eyes to exactly how I can link SQL to my Java code (although at the moment I do not want to host a web server). That by itself is invaluable to me. I also really enjoyed using JavaScript with MongoDB. Map-reduce, specifically, is an interesting concept that I did not know about and am glad that I was able to experience. Finally, I think that learning about database designs (star schemas, relational schemas, normalization) is extremely valuable because in the future I will possibly have to create my own database from scratch.

Tuesday, June 15, 2021

CST 363 - Week 7

Data Warehouses

Data warehouses are used to store data from multiple sources in a single location. Not all data that a company uses must be stored in a warehouse. Oftentimes the data that is stored in these warehouses is long-term in nature and do not require much updating. Many of the records in these databases are essentially read-only and are used for gathering intelligence about a business, such as its per-quarter earnings, its most popular items, or customer demographics.

Other than for basic backup purposes, it is important to have copies of data in a data warehouse because the warehouses are often followed by new, improved versions of themselves. After the initial version of a data warehouse is created, requirements may be altered which can require the database to be changed or expanded. Fortunately, there is no reason to start again from scratch because the new database can be built upon existing data.

Monday, June 7, 2021

CST363 - Week 6

SQL vs NoSQL

Most may believe that "NoSQL" is just a modern database experiment, but SQL-style databases were not the first around. In fact, NoSQL (meaning, non-SQL based) databases existed in the 1960s, whereas Oracle's SQL was not developed until the 1970s. However, the popularity of SQL databases cannot be ignored. The truth is that most databases are relational because it is a tried-and-true method of storing, retrieving, and updating large amounts of data. For small, simple data sets, any type of database should suffice. An example of when one may want to use a NoSQL database is when working with JSON documents, which are popular for web development due to their similarity to JavaScript objects and ability to handle high volumes of unstructured web data in real-time.

Cassandra vs MongoDB

Two of the most popular NoSQL databases are Cassandra and MongoDB, both open-source systems developed at roughly the same time. One difference between them is how they handle queries. MongoDB uses JSON-type documents, but Cassandra has its own language called CQL (Cassandra Query Language), which is similar to SQL. MongoDB also uses a different architecture, requiring a single "master" node to control all "slave" nodes, while Cassandra has multiple master nodes per cluster, leading to greater uptime and write speeds.

Choosing the right database can be tricky. In the case of MongoDB versus Cassandra, it seems like Cassandra is the better choice for high-volume systems that cannot go offline. It also provides a familiar user experience due to its similarity to SQL. However, MongoDB is supported by more programming languages, has built-in aggregation (Cassandra relies on 3rd party tools), and is more flexible in that it does not require use of a specific schema. Ultimately, each project has its own unique needs, and the project requirements will dictate which database to use.

Tuesday, June 1, 2021

CST363 - Week 5

Slow Indexes

This week I was exposed to an awesome website (Use the Index, Luke!) that helped me understand how databases work. One of key feature of databases is indexes. Indexes in a database are just like indexes at the back of a book: They point to the place you need to go. Databases use something called a B+ tree to implement indexes. A B+ tree is a combination of a balanced tree with a linked list. While the traversal time from the root of the tree to the leaves is logarithmic, the traversal time from leaf-to-leaf is linear. If a select statement requests a range of rows, for example, the index will allow the database to quickly find the start of the range, but it will still have travel to the end of the range one leaf at a time. There is also the time needed to actually fetch the data from the table.

To summarize, there are actually three steps to an index lookup: tree traversal, leaf node following, and data fetching. The only step that is guaranteed to be quick is the first one (tree traversal). The other 2 can take a long time if there are many records to go through and retrieve. Therefore, it is a myth that "slow indexes" actually exist, and taking drastic action (such as rebuilding the index) is not necessary.

Trials of an Upper Division Computer Science Student