Friday, October 17, 2014

Summation Notation: Sigma IS Sum

The time has come to discuss actual mathematical operations, specifically, summation notation. This post will be featuring the capital iteration of your soon to be favorite greek letter, sigma (Σ).

Summation notation is also known as sigma notation, or weird squiggly line from hell, depending on your preference. In statistics, summation notation is unavoidable. Getting comfortable with summation notation, and reviewing the math concepts that go along with it, will allow you to understand and use the formulas involved in fundamental statistical concepts.

Summation notation is helpful/inevitable when you are working with a sequence of numbers. A sequence of numbers is a list of numbers that is in order. When you are working with a dataset you are working with sequences of numbers. A sequence of n numbers can be denoted as {x1, x2, x3,..., xn}.

So, consider the sequence: {1, 4, 2, 7, 5}. For this sequence the following will be true:

n = 5
x= 1
x= 4
x= 2
x= 7
x= 5

In summation notation an expression (also referred to as a function or formula) is evaluated for given values in a sequence. The results of that expression are added together, hence the term summation notation. The following graphic breaks down the basic features in summation notation:


Using our sequence from before {1, 4, 2, 7, 5} the above summation notation turns into:


Given that i=1, we start with the first observation (which equals1). The expression to be evaluated is simply the value at the given index. We then move on and add the value of the expression for observation 2 (which equals 4), and so forth up until observation number n, which is the last observation in our sequence.

It may have been a while since you took a math course. You may have hoped you would never have to use math again. That's ok, you're ok, it will all be ok. Before diving into summation notation it is important to take it back to basics: Order of operations. You may remember a little something like this from your math days:

PEMDAS
Parentheses
Exponents
Multiplication
Division
Addition
Subtraction

Summation notation is addition in wolf's clothing. What does this mean? It means that despite its intimidating display summation notation is a basic operation. Why is this important? It is important because order of operations can make seemingly similar expressions mean completely different things. Order of operations is like punctuation for math. I would argue PEMDAS is easier but that is because I have some serious anxiety stemming from commas and semicolons.

Example time!

We will stick with our previous sequence: {1, 4, 2, 7, 5}.

Consider the following:

Your initial inclination may be to expand this equation in the following way:


Your initial inclination is not your friend. The above expansion would be appropriate for the following equation:


The important difference here is the parentheses. With parentheses the "+2" is included within the summation. The order of operations dictates that anything within parentheses is computed before any addition operations.  This means that 2 is added to every value used in the summation. When there are no parentheses, the proper expansion is:


In this case, 2 is added after the summation is carried out.

When dealing with summations, it is helpful to remember that basic algebra may save you time and computation errors. Examine the following two formulas:




Notice anything? The formulas certainly look similar. In fact, they provide us with the same results. Let's look at the expansions:



By factoring out the common denominator, we can simplify this to:


In summation form this would be:


Oftentimes, we are not just working with a single sequence. When we have data on multiple variables, we are working with multiple sequences. Let's say we now have two variables. The data can be represented by two sequences:

X = {1, 4, 2, 7, 5}
Y = {2, 5, 3, 6, 1}

The same rules apply to summation notation with two variables. Consider this:


This results in the following expansion:


The values from set X and set Y that share the same index were multiplied together. These products were then summed. When working with multiple sequences it may be helpful to arrange values in a table. A table for sequences X and Y might look like this:



That covers the basics of summation notation. If you have grown to love our dear friend sigma, never fear, this is only the beginning of what's sure to be a beautiful (unavoidable) friendship.



Sunday, October 12, 2014

NOIR: The Glamorous World of Measurement

Many intro stats courses start out with an overview of levels of measurement, sometimes called types of measurement scales. There will be some powerpoint slides, all of a sudden the meaning of 0 will become a confusing matter of grave importance, you might have a quiz, and then the whole sordid affair will fade into the background occasionally being drudged up when talking about assumptions.

But wait a minute, this stuff is like, sort of important. There is a reason this is the topic many stats courses cover first. Math in the k-12 arena tends to use numbers. You might be thinking yeah, duh, I mean it is math. But the thing is when you are moving into the realm of stats and measurement you are shifting from viewing a number as just a number, to something a bit more fuzzy. Welcome to the real world, things can get a little messy.

So let's begin with levels of measurement. In science (yes, even social science) data (information) is collected. This data is obtained by using some sort of scale. That data can be analyzed and might even result in meaningful conclusions, given that you understand what you have collected and how to analyze your unique special snowflake data. Type of scale will  help you to determine which analyses will provide you with meaningful results.

So let's discuss levels of measurement and the implications.

Nominal

Nominal scales have two or more categories . The data produced by these scales are unique in that different values indicate different classifications, but those differences do not have an implied  order.

It does not make sense to calculate the average value of a variable that is scored on a nominal scale. However can perform calculations that are based on frequencies (counting how many subjects responded in a given way).

Consider the following question:

Which flavor of ice cream is your favorite?
  1. Vanilla
  2. Chocolate
  3. Strawberry
  4. Rocky Road
The dataset (collection of individual responses) may contain the numbers 1-4 to indicate the response to this question. However, those numbers don't really mean much, someone who answered 4 did not score higher than someone who marked 1, those two people just answered differently (even if Vanilla is lame and Rocky Road will always reign supreme).

Other commonly used examples of variables measured with Nominal scales are: Race, political affiliation, gender, major, and religion.

Ordinal

The values produced by ordinal scales have an implied order but there is not necessarily an equal distance between values. Ordinal scales often occur when people or things are being ranked.

One of the most common examples of an ordinal scale is how a person places in a race (1st, 2nd, 3rd, etc...). A person who comes in first in a marathon completed the marathon faster than the person who came in second. However we cannot say that  the time difference between the 1st place finisher and the 2nd place finisher is equal to the time difference between the 2nd place finisher and the third place finisher.

Would it make sense to take an average of ordinal variable? Let's stick with the example of a marathon runner. If a person wanted to track his or her performance in various marathons would he or she want to look at his or her average place (which could have a value of 5.5) or would he or she want to look at his or her average time? I would argue that an average place does not have a clear meaning. If I were in the habit of running marathons (full disclosure: I'm most certainly not) I would be concerned with my average time and how my new times compare to my old, rather than how I was ranking compared to a changing group of competitors.

Other common examples of Ordinal scales are: class rank and items that use a likert scale (strongly disagree... strongly agree).

Interval

Interval scales, like ordinal scales, contain values with a meaningful order. Interval scales also have equal distances between the values. Interval scales, however, do not have an absolute zero point.

A common example of an interval scale is temperature when measured in degrees Fahrenheit or degrees Celsius. The difference between 0° Fahrenheit and 10° Fahrenheit is the same as the difference between -10° Fahrenheit and 0° Fahrenheit.

Calculating the average temperature for a given month would give us a meaningful result. However, we cannot say that when it is 40° Fahrenheit that it is twice as hot as when it is 20° Fahrenheit. That is because a value of 0° Fahrenheit is not an absolute 0 point. There can be negative values. Think about this:

If I claimed that 20° was 2 times as cold as 40° then, applying the same reasoning, -20° would be -2 times as cold as 40°. That does not make much sense.

Other examples of interval scales are: shoe size and women's pants size (0, 2, 4...).

Ratio

Ratio scales have values with a meaningful order, equal distances between points, and an absolute 0 point. The absolute 0 point indicates that none of the variable being measured is present.

Common examples of variables measured with Ratio scales include height (inches or centimeters), weight (pounds), income (dollars), and age (years).

As the name implies, Ratio scales mean that ratio calculations can be meaningfully applied. Sally can be twice as tall as Billy.

Choosing a Level of Measurement

So at this point you may be feeling pretty comfortable with levels of measurement, that's great! Sometimes thing can get a little fuzzy, however. For example think about two possible questions we could include in a survey:

Question A: Please enter your yearly income: _________

Question B: Please indicate your income level:

  1. Under $25,000
  2. Between $25,000 and $50,000
  3. Between $50,000 to $100,000
  4. Above $100,000
The responses to question A would provide us with data on a Ratio scale. Question A has a 0 point of $0 and the value of a single dollar is a standard metric, giving us equal distance between points. 

The responses to question B, however, would give us data on an Ordinal scale. Income is lowest for response option 1 and highest for response option 4.  Note that the difference in income between the response options is not constant. 

Why would someone choose to use Question B instead of Question A? Well, Question B may result in less user error. When respondents are asked to fill in blanks $10,000 can easily turn into $100,000 by mistake. Maybe the researcher just wants to get a general idea of the income spread of survey respondents to ensure that the sample reflects the population of interest but the researcher is not interested in doing any in-depth analysis of income.

When considering what type of data you will collect it is important to determine what questions you want to answer with your data. The more clearly defined your questions are, the easier it will be to design a study and analysis plan. 

Shades of Gray

This next section is generally beyond the level of an introductory course, but it may be worthwhile to read and ponder if you are thinking of pursuing a career in the social sciences, or if you just love learning/ measurement/ procrastinating. 

While levels of measurement may seem clear cut at this stage things can get wonky, especially in the behavioral sciences. For example, think about a scale measuring depression. Picture a simple, 10 item scale where a person either marks "Agree" or "Disagree" for each item on the scale. Items may be similar to "I have felt sad in the past week." and "I have considered killing myself in the past week." People could endorse 0 items all the way up to 10 items. 

Since it is possible for people to endorse 0 items you may initially think this is a Ratio scale. But does endorsing 0 items indicate a complete lack of depression? Maybe, it would be pretty impressive if we covered all possible indicators of depression in only 10 items. 

Say we give up on an absolute 0 point. Each point indicates an endorsement of a question so it would be reasonable to think we have equal distance between points. So we have an interval scale, right? This would mean that each question should carry an equal weight of depression. Is feeling sad equal to thinking about suicide? Things are getting tricky. 

Ok so maybe we don't have a 0 points, and maybe we don't have equal distances. But a score of 6 is definitely greater than a score of 5, and so on. Or is it? If our questions may indicate different amounts (or severity) of depression then how can we order people based on a simple count of the number of items endorsed? What if one person endorsed 6 seemingly less serious items and another person endorsed 4 seemingly very serious items. 

Questions related to measurement can be confusing. In the social sciences we rely heavily on surveys to give us insights into human thoughts and behaviors. It is important to remember that things are not always as straightforward as they seem, or as we wish them to be. However, these grey areas can be intriguing. If we knew everything then what would be left to argue about during department happy hours?

Thursday, October 9, 2014

The Power in Statistics: A Non-Beta Explanation of Why This Blog Exists

On the first day of lab every year I like to go around the room and have students introduce themselves, say some random fact, and tell me how they feel about the course. I teach a lab for introduction to statistics for psychology majors. There tends to be a significant amount of fear in the room. This is the psych department after all. Psychology is a "soft" science, it is all about feeling feelings and the general power of intuition, numbers just don't jive with the warm and fuzzy feeling we are trying to cultivate. Right? RIGHT?

....Well, hopefully, wrong. Psychology and other social sciences are often scoffed at by the "hard" sciences. You know, the people with lab coats, large budgets, and buildings that need to be evacuated with shocking frequency (seriously guys, enough with the chemical spills). But when it comes to the social sciences the science part comes through with an understanding of statistics, research methods, and measurement.

These three topic areas are interrelated and at the undergraduate level the typical Intro Stats for Psych Majors (or Business Majors, Sociology Majors.... Invented Majors) tends to be a giant party full of greek letters (sans kegs), validity, and some talk of randomization. These classes vary in their approach but by the look of my Facebook wall the student freak outs are a unifying feature.

So why blog about it? I am a PhD student in Psychology and I actually like statistics. I do not need everyone to like statistics but I do feel that a basic understanding of stats, measurement, and methods gives people power, no matter what their future holds.

The concepts covered in these intro courses are used everywhere. If you master a basic understanding of stats, measurement, and methods you open the door to understanding the processes underlying decision making and theory generation in a variety of fields. Medical decisions, economic policy, rocket science, it all comes back to the basics. The problem is, these concepts are often seen as something to get through, a gateway requirement, and the resources for undergraduates can be lacking.

This blog is meant to be accessible to the novice. The kid in the lecture hall wearing the questionable sweatpants wondering when math changed from seemingly sensible numbers to wonky ancient alphabets. It is meant to provide overviews of the basics in an easy to access format. I am not an expert in statistics. I do however have some experience with student stats freak outs.

I am writing this because I think when it comes to stats it is helpful to have a variety of tools available. If this blog helps you, fantastic, if you hate it, find something you don't. The topics we will cover are things I love to talk and think about. They are things that once you master I hope will help you to feel empowered.

And if you are reading this just so you can get through your intro course with the required B-, that's ok too.