Your browser does not support JavaScript

Michelle Pearse wrote an introduction to finding data for Research Assistants. The guide can be found here.

After the research question and hypothesis are posed, the next step is to collect data so that the hypothesis can be tested. The hypothesis posits a relationship between two or more variables. That is, an outcome (dependent) variable is influenced by one or more explanatory (independent) variables.

Example: “Homicide conviction rates in Massachusetts (dependent variable) will be lower in the post Miranda v Arizona period (independent variable).”

Following the formulation of the above hypothesis, we next want to operationalize the concepts within it. How can these concepts be observed and measured? Conviction rates in Massachusetts could be the ratio of guilty to total rulings for all homicide cases in Massachusetts over the period 1950-2011. The post-Miranda period includes all years following the Supreme Court decision in 1966. We want to make sure that the variables we are using as measurements of the concepts in our hypothesis are both:

  • Valid – the measure reflects the underlying concept accurately.
    • Does the scale accurately report my weight?
  • Reliable – the measure will produce similar a similar value when the measuring instrument is reapplied. A measure which is reliable need not be valid; indeed it may consistently produce similar but nevertheless biased estimates.
    • When stepping on the scale multiple times, does it return a consistent weight estimate?

Data are typically classified into two categories—qualitative and quantitative. The levels of measurement are as follows:

  • Nominal data are one form of qualitative data where objects have no natural order (e.g. gender, race, religion, brand name). It does not make sense to think of Buddhism being “more than” Confucianism.
  • Ordinal data are another form of qualitative data—specifically, groups which can be ranked. An example of an ordinal variable is a survey respondent’s sense of agreement (e.g. strongly agree, agree, disagree, and strongly disagree). These responses do have a natural order and can be ranked, although the distance between each response is difficult to determine.
  • Interval data are one from of quantitative data which have a definite natural order and, unlike ordinal data, the difference between data can be determined and is meaningful. Interval-level data do not have a natural zero point, however. For example, 0 degrees on the Fahrenheit scale is arbitrary and therefore 100 degrees Fahrenheit is not twice as warm as 50 degrees.
  • Ratio data are the second form of quantitative data. In contrast, to interval data, ratio-level data have a non-arbitrary 0 point. For example, 0 yards means no length. 100 yards is twice as far as 50 yards.

Even though qualitative data are mostly based on unordered groups, they can nevertheless be analyzed quantitatively. This is achieved by coding the qualitative data of interest into numerical values. For example, if we are running a survey, we can transform gender (nominal data) into a dichotomous (dummy) variable with each respondent assigned a 1 if female and 0 if male. Likewise, the attitudinal responses on the survey can be assigned numerical values as well, for example, Strongly Agree = 4; Agree = 3; Disagree = 2; Strongly Disagree = 1. Once qualitative data have been coded into numerical variables, they can be analyzed using both basic and advanced statistical models.

In empirical legal research, content coding of natural language text is commonly employed (see Hall and Wright, 2008; Evans et al., 2007). Content analysis is a popular methodology which, for example, can be employed to summarize characteristics of interest related to court decisions. When possible, it is always best to have individuals other than the researcher code variables as to reduce bias. A great discussion on intercoder reliability and other issues to consider when having others code for you can be found here.

Empirical research on legal issues can rely on primary (original) as well as secondary (obtained from elsewhere) data. Bradley Wright and Robert Christensen, for example, in studying the effects of public service motivation on job sector choice, employ an original survey of law students in one study (Christensen and Wright, 2011) as well as survey data from the American Bar Association in another study (Wright and Christensen, 2010).

Another data collection technique is webscraping, using software to visit web sites and extract specific bits of information.  Here is a tutorial on web scraping written in the R language that was prepared by Jonathan Whittinghill, the Applied Research Statistician at the HLS Empirical Research Services.