Peer to Peer Knowledge Base Article:
Literature Review
The Foundational Role of SQL in Data Extraction SQL is the fundamental language for interacting with relational databases. It is used to store, manipulate and retrieve data. A relational database organizes data into tables, which are linked based on predefined relationships. SQL allows analysts to communicate with these databases to extract exactly the data they need. Without SQL, accessing data from large, structured databases would be incredibly difficult and slow. As Groff, Weinberg, & Oppel (2010) state, SQL’s power lies in its declarative nature; you specify what data you want, not how to get it and the database engine figures out the most efficient way to execute the request. Basic SQL involves commands like SELECT, FROM and WHERE. These are used to pull data from a single table based on simple conditions. For example, SELECT * FROM customers WHERE country = ‘USA’; will return all customers from the USA. While this is useful, real-world data analysis is rarely this simple. Data is almost always spread across multiple related tables. This is where the need for advanced SQL begins. The ability to efficiently combine data from different tables and perform complex calculations directly within the database is what sets proficient analysts apart. Efficient data extraction is the first and most critical step in the entire analytical workflow. If this step is slow or yields incorrect data, all subsequent steps are compromised.
Key Advanced SQL Techniques for Efficiency and Accuracy Advanced SQL techniques move beyond simple data selection to enable powerful data transformation and aggregation directly within the database. This improves efficiency by reducing the data volume before it is loaded into analysis tools like Excel. It enhances accuracy by ensuring complex logic is applied consistently in one place. JOIN Operations: The JOIN clause is used to combine rows from two or more tables based on a related column. Understanding different types of joins—INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL OUTER JOIN—is crucial. An INNER JOIN returns only the matching rows from both tables, while a LEFT JOIN returns all rows from the left table and the matched rows from the right table. Using the wrong type of join is a common source of data inaccuracy, as it can inadvertently exclude or include incorrect records (Beaulieu, 2009). For efficiency, properly indexed join keys are essential for fast query performance. Subqueries and Common Table Expressions (CTEs): A subquery is a query nested inside another query. It can be used to filter data or perform calculations in a step-by-step manner. However, complex subqueries can be difficult to read and maintain. CTEs provide a more readable and efficient alternative. A CTE creates a temporary result set that you can reference within your main query. This makes complex queries easier to write, debug and understand, thereby reducing errors and improving maintainability (Morgado, 2017). For example, you can use a CTE to first calculate a complex aggregation and then join that result to another table in the main query. Window Functions: Window functions are among the most powerful features in advanced SQL. They perform a calculation across a set of table rows that are somehow related to the current row. Unlike regular aggregate functions, window functions do not cause rows to become grouped into a single output row. This allows you to see both the detailed data and the aggregated value side-by-side. Common window functions include ROW_NUMBER(), RANK(), LAG(), LEAD() and aggregation with OVER(). For instance, you can use LAG() to compare a sales value from the current month to the previous month without having to perform a self-join, which is a much less efficient operation (Tao, 2021). This directly enhances both efficiency (faster calculation) and accuracy (less complex, error-prone logic). Aggregate Functions with GROUP BY and HAVING: While GROUP BY is a fundamental concept, its advanced use with filtering conditions using HAVING is critical. The WHERE clause filters rows before aggregation, while the HAVING clause filters groups after aggregation. Using HAVING correctly is vital for accuracy when you need to filter based on the result of an aggregate function, like finding only those departments where the average salary is above a certain threshold.
The Foundational Role of Excel in Data Analysis and Reporting Microsoft Excel is arguably the most ubiquitous data analysis tool in the world. Its accessibility and intuitive grid interface make it the starting point for millions of analysts. Excel is used for tasks ranging from simple lists and calculations to complex financial models and dashboards. Its strength lies in its flexibility for ad-hoc analysis, visualization and reporting. As Winston (2016) notes, Excel’s power is not just in its functions, but in its ability to let users “play” with data, fostering exploration and discovery. Basic Excel proficiency involves using simple formulas like SUM and AVERAGE, creating basic charts and using filters. Many users stop at this level. However, this limits their analytical capabilities and forces them into manual, repetitive work. When datasets become large or complex, basic Excel methods break down. They become slow, prone to crashing and the risk of formula errors increases dramatically. This is where advanced Excel techniques become essential. They transform Excel from a simple calculator into a powerful data analysis and business intelligence tool, capable of handling much more sophisticated tasks efficiently and reliably.
The world’s first electronic spreadsheet, VisiCalc, was created by Dan Bricklin and Bob Frankston in 1979. The spreadsheet was widely regarded as the ‘killer app’ for personal computers. It managed to hit a sweet spot between usability and functionality—millions of users with no formal training in programming were suddenly enabled to create custom applications of their own. Today, the spreadsheet remains as popular as ever. Indeed, in 2001 there were an estimated 45 million end-users of spreadsheets or databases in the United States alone, representing 60% of the American workforce (Scaffidi et al., 2005), and this number is rising. Spreadsheets do one thing and they do it well, which is to perform mathematical computations. While spreadsheets do have built-in Boolean functions and conditional functions, it is clear that they were not designed from the ground up to support logical reasoning. In modern commercial spreadsheets, logical functions have an inelegant syntax and inappropriate editing tools. Worse, the logical reasoning allowed is quite limited—there are no n-ary predicates, unification routines, negation-as-failure capabilities, and the like. By extending a spreadsheet with proper logical reasoning capabilities, we allow them to be used for many additional applications, such as applying business rules, performing symbolic what-if analyses, and transforming data from one representation to another. By incorporating further enhancements such as generalizing formulas to constraints, spreadsheets can be further used for applications such as data entry and validation, enterprise management, and constraint-solving. In the sequel, we describe the work that has been done on logical spreadsheets up until today, and attempt to categorize the different logical spreadsheet systems that have been produced.
IF Statement (and other IFs)
IF
An IF statement is powerful; it helps you do everything from fix typos to group data. In the example below, I create a “Mac?” column that is True when the cat is Mac.

The formula for this is:
=if(C2=”Mac”, True, False)
If we imagine we had the same cat table in Snowflake, the equivalent formula would be:
SELECT
*,
IFF(cat_name = ‘Mac’, True, False) AS is_mac
FROM cat_table;
In a database, double quotes (“) also denotes a database object, such as a column name. A literal string (Mac) is put in single quotes. Other than that, the function is quite similar.
Snowflake reference for IFF
IFS
The complexity comes in once we there are multiple IF statements to evaluate. To determine whether the cat is one of the twins, I have to allow that the cat could be either Mac or Cheese (they actually had another litter mate who we could not adopt). I also want to label them Twin1 (Mac) and Twin2 (Cheese). In Excel, the function is:
=if(C2=”Mac”, “Twin1″, if(C2=”Cheese”, “Twin2”, “Not Twin”))

Notice how in order to have more than two outcomes (True/False versus Twin1/Twin2/Not Twin), we need to embed another IF statement into the false_value. This gets unreadable really quickly. Luckily, in SQL you can do what’s called a case statement or a case when statement. A case statement for this would be:
SELECT
*,
CASE
WHEN cat_name = ‘Mac’
THEN ‘Twin1’
WHEN cat_name = ‘Cheese’
THEN ‘Twin2’
ELSE ‘Not Twin’
END AS twin_number
FROM cat_table;
Maybe it is the spacing or the lack of parenthesis, but I find this more readable and intuitive. Notice the single quotes again. The ELSE clause is optional; without explicitly giving a value to those that do not fall in a category, they will be NULL. I often write “ELSE NULL END” to be explicit, but this is technically redundant. Similar to Excel, the resulting value of the CASE WHEN is whichever evaluates as true first. In our example, a cat cannot be named Mac and Cheese, so let’s say we had this statement:
SELECT
*,
CASE
WHEN cat_name = ‘Mac’
THEN ‘Twin1’
WHEN cat_color = ‘Black’
THEN ‘Twin2’
ELSE ‘Not Twin’
END AS twin_number
FROM cat_table;
Although Mac is black, he will have twin_number of Twin1 since cat_name = ‘Mac’ evaluates to true first.
Note: I tried to use the Excel function IFS but I couldn’t figure out how to include a default value (“Not Twin”):
=ifs(C2=”Mac”, “Twin1″, F2=”Black”, “Twin2”)
SUMIF
Now that we know who the twins are, we can sum only the twins’ weights. In Excel, we would do: =sumif(I2:I4,True,G2:G4)

The twins total weight is almost 20 pounds!
SELECT
CASE
WHEN cat_name = ‘Mac’ OR cat_name = ‘Cheese’
THEN TRUE
ELSE FALSE
END AS is_twin,
SUM(weight) AS total_weight
FROM cat_table
GROUP BY is_twin;
This requires a bit of explanation. First we are using a case statement to label whether a cat is a twin or not. Then we sum the weights of cats grouped by is_twin. This will give us:
| IS_TWIN | TOTAL_WEIGHT |
| TRUE | 19.8 |
| FALSE | 7 |
The results are not formatted the same as a SUMIF since we also get the non-twins weights, but we get the data we need: 19.8 pounds of mini demon panthers sent here to ruin my couch.
This query can also be written as
SELECT
SUM(weight) AS total_weight
FROM cat_table
WHERE cat_name = ‘Mac’ OR cat_name = ‘Cheese’
;
This will simply give you 19.8.
COUNTIF
If we want to count the number of female and male kittens we have, in Excel, we would do:
=countif(M1:M4, “Female”) and
=countif(M1:M4, “Male”)

Similar to the total_weight calculation above, this can be accomplished in SQL by:
SELECT
gender,
COUNT(*) AS count
FROM cat_table
GROUP BY gender
This will give us
| GENDER | COUNT |
| male | 1 |
| female | 2 |
Note that Snowflake actually does have a count if: https://docs.snowflake.com/en/sql-reference/functions/count_if.html
SELECT
COUNTIF(gender = ‘Female’) AS female_count
FROM cat_table
I never use that function but maybe I’ll remember to now.
SUBSTITUTE
Let’s say we got user entered info for our cat named and they appeared like this:

We would want to clean it up by removing the _, essentially replacing it with a blank string. We would do a substitute like this:
=substitute(B2, “_”, “”)
In Snowflake, we do a replace function. With replace, if we want to eliminate, we can enter only two arguments if we want to replace the substring with the empty string:
SELECT
REPLACE(user_entered_cat_name, ‘_’) AS cat_name
FROM cat_table
Similar to in Excel, if you want to replace multiple strings, you have to embed the REPLACE function:
SELECT
REPLACE(REPLACE(user_entered_cat_name, ‘_’), ‘!’) AS cat_name
FROM cat_table
It can end up being a really nasty line of code.
CONCATENATE
To conCATenate in Excel (ha, get it), you use &

In Snowflake, you use || or the CONCAT function
No matter which one you do, remember the spaces:
SELECT
cat_name || ‘ ‘ || cat_id,
CONCAT(cat_name, ‘ ‘, cat_id)
FROM cat_table
VLOOKUP / HLOOKUP
I know often in Excel, in order to combine data from multiple tabs, people do a VLOOKUP or an HLOOKUP. In databases, a VLOOKUP is basically a JOIN. An HLOOKUP is not really a thing. I’ll explain why.
VLOOKUP
Let’s say someone gave us the cats cuteness levels

To find the cats cuteness level, we do a VLOOKUP on Cat ID
=VLOOKUP(A2,Cuteness!A:B,2,True)

Perfect, the cats are all top level cuteness. Notice that the vlookup is “on” a certain field. This is the field that the two datasets have in common; more specifically, the field that has to be equal in order for the data to be relevant for the row. To do this in databases, we join:
SELECT
*
FROM cat_info
LEFT JOIN cat_cuteness
ON cat_info.cat_id = cat_cuteness.cat_id
Notice that we are joining “on” cat_id. In SQL, you can join on multiple fields (or none! don’t do that!). It’s most common to join on an ID field though.
Notice also that I used a LEFT join. Unlike in a VLOOKUP, we can choose what to include in the final data set. Had I used INNER JOIN, any cat_info without a cat_id in cat_cuteness would not have shown up in the data set. A RIGHT join would be the reverse of this; it would be like doing the VLOOKUP on the cat_cuteness tab.
HLOOKUP
To be honest I haven’t used Excel in so long I couldn’t find a reason why you would need an HLOOKUP, except that someone who hates you gave you data. That’s okay. It happens.

Let’s say someone gave you that. In order to get the info on to the main tab, we have to do:
=hlookup(A2,’Cuteness, but someone who hates us gave it to us’!$A$1:$D$3,2,false)

There is no real hlookup in databases, because in databases, all tables and entities have to be formatted in the same way (with the column names being the first row).
If you can think of an example of HLOOKUP that you would be able to do in a database, let me know.
Logical Functions and Formulas
Learning Outcomes
- Use logical functions and formulas
Excel logic functions evaluate whether the statement and data are considered true or false according to how the formula is established. We will cover the top used Excel logical functions in this section. Exactly like the financial functions, you can use the Formulas tab as before to insert the function or you can begin by typing an equal sign in a cell.
Nested IF
Taking an IF function and adding more than one logic test inside the IF function. In other words, start with and IF function and add another IF function inside the original IF function. A Nested IF formula looks like this: =IF(logical_test,[value_if_true],[value_if_false],IF(logical_test,[value_if_true],[value_if_false]))
Previously, you learned how to use the IF function as a logical way to test your data. Let’s consider that we want to see two variables run at once to create a logical outcome. We’ll now look at Regional Sales for five salespeople over a year and see if they meet the requirements for an annual commission and what the commission amount would be.
Note
Two shortcuts for locking down a cell to make it absolute instead of relative. You’ll need to know these to work faster through creating formulas.
- Short cut key F4 automatically adds in $ to a cell location information to lock it down and make it absolute to stop it from changing as it is dragged into other cells or ranges (e.g. $D$3).
- Name Range locks down a cell like $. To create a Name Range, highlight the cell, click on the Formulas tab, Define Name button. After the dialog box opens, name the location (no spaces) to make it unique for navigation or for formulas (e.g. Commission_Rate).
With a sales spreadsheet open, look at the tiered commission structure. There are two possibilities to earn commission. A Nested IF function is a perfect formula to calculate which salesperson receives how much commission.
Follow these steps to create a Nested IF function:
- First let’s define names for the two types of commissions as this will make it easier to distinguish in the formula. Select the 20% cell and click on the Formula tab, Define Name button and name it Commission_Rate_20. Follow the same steps to name the 10% commission cell too.
- Select the first cell under the commission heading and begin typing =IF, then hit the Tab key and a bracket will automatically appear displaying the logic formula for an IF function.
- Select the total sales for Henry (G4) and type in whether the total sales is less than or equal to the sales goal (G4>=I2). Be sure to hit the F4 button and the absolute $ will fill I2 so it will not change in any way if the formula is moved ($I$2), then type a comma.
- The next portion of the formula is for the percentage of commission to be paid if the goal has been reached. After the comma, choose the sales total again (F4) multiplied by the commission percentage by selecting the name you created earlier (F4*Commession_Rate_20), then a comma to separate the next part of the formula.
- The next portion of the formula is for the other percentage of commission to be paid if that lower goal has been met. After the comma, type another IF and hit the Tab key. Create the same formula again but use the Commission_Rate_10 this time and at the end a comma.
- The last portion of the formula is asking what to do if the logic comes back as a false answer. In this case, it will return a 0. Now finish with two end parentheses, one turns red indicating the second IF function, and then an additional end parenthesis to enclose the entire function begun with the first IF function.
- Now copy the formula down the column all the way to the last salesperson’s total. These are the commissions paid based on the Nested IF formulas created. A nice logical function to make a more difficult task easier in a spreadsheet.

AND
This function returns TRUE if all the arguments in its formula are TRUE and returns FALSE if any of the conditions are false. The Excel formula for this is =AND(logical1,[logical2],…).
In this example the AND function returns TRUE if the first score is greater than or equal to 60 and the second score is greater than or equal to 90, otherwise it returns FALSE.

OR
The OR function returns TRUE if any of the arguments are TRUE. The Excel formula for this is =OR(logical1,[logical2],…).
In this example the AND function returns TRUE if the first score is greater than or equal to 65 and the second score is greater than or equal to 100, otherwise it returns FALSE.

IFERROR
If your formula errors out, you can add in a value you specify how you would like it displayed. This logic function is beneficial to use if you occasionally run into errors and wish to have a cleaner looking spreadsheet to present.
The Excel formula for this is =IFERROR(value,value_if_error).
In this example there is an error in the Profit Margin row that is displayed at #DIV/0!. By using the IFERROR function an error like this can be displayed as a zero or even as text like “Ouch” if so desired. Here is what it looks like:

