Bee Colony Loss Data

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.4     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages

library(skimr)


colony <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/colony.csv')

Rows: 1222 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): months, state
dbl (8): year, colony_n, colony_max, colony_lost, colony_lost_pct, colony_ad...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

colony %>%
  head()

# A tibble: 6 × 10
   year months     state colon…¹ colon…² colon…³ colon…⁴ colon…⁵ colon…⁶ colon…⁷
  <dbl> <chr>      <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1  2015 January-M… Alab…    7000    7000    1800      26    2800     250       4
2  2015 January-M… Ariz…   35000   35000    4600      13    3400    2100       6
3  2015 January-M… Arka…   13000   14000    1500      11    1200      90       1
4  2015 January-M… Cali… 1440000 1690000  255000      15  250000  124000       7
5  2015 January-M… Colo…    3500   12500    1500      12     200     140       1
6  2015 January-M… Conn…    3900    3900     870      22     290      NA      NA
# … with abbreviated variable names ¹colony_n, ²colony_max, ³colony_lost,
#   ⁴colony_lost_pct, ⁵colony_added, ⁶colony_reno, ⁷colony_reno_pct

colony_splits <- initial_split(colony, prop = 0.5)

training_data <- training(colony_splits)
test_data <- testing(colony_splits)

Bee Colony Loss Analysis

Abstract

This is an analysis of the Bee Colony data set. The bee colony data set provides information on honey bee colonies across the United States with regards to the number of colonies, maximum, lost, percent lost, added, renovated, and percent renovated. In this technical report I aim to answer the following questions: during which months did we lose the most bee colonies? During which years did we lose the most bee colonies? Is there a state that seems superior to the others in terms of honey bee colony size? Is there a state that seems to lose more honey bee colonies than others?

This data set spans several years during which the number of bee colony losses per state were observed and recorded. Using RStudio we were able to create box plots, allowing for better visualization and analysis of the data when focusing on correlation between months/year or state and bee colony losses observed. Using RStudio I was able to deduce from the graphs and tables created that the most bee colonies were lost during October-December.

Interesting Questions

During which months did we lose the most bee colonies? The least?
During which years did we lose the most bee colonies? The least?
Is there a state that seems superior to the others in terms of colony size?
Is there a state that seems to lose more bee colonies than others?

Hypotheses

I hypothesize that the most bee colonies were lost throughout October-December and the least bee colonies were lost throughout April-June.
I hypothesize that the most bee colonies were lost during 2015 and the least bee colonies were lost during 2017.
I hypothesize that California is the largest in size and has the most bee colonies.
I hypothesize that Florida lost the most bee colonies during the years this data was collected.

training_data %>%
  skim()

Data summary
Name	Piped data
Number of rows	611
Number of columns	10
_______________________
Column type frequency:
character	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
months	0	1	10	16	0	4	0
state	0	1	4	14	0	47	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2017.81	1.81	2015	2016.00	2018	2019	2021	▇▅▅▅▆
colony_n	27	0.96	110319.64	400326.11	1300	8000.00	17000	53000	3181180	▇▁▁▁▁
colony_max	37	0.94	75456.31	168999.94	1700	9000.00	21000	65000	1400000	▇▁▁▁▁
colony_lost	27	0.96	14750.45	54305.79	20	930.00	2100	6500	500020	▇▁▁▁▁
colony_lost_pct	31	0.95	11.35	7.15	1	6.75	10	15	52	▇▅▁▁▁
colony_added	45	0.93	15586.61	62574.11	10	402.50	1900	6000	736920	▇▁▁▁▁
colony_reno	70	0.89	13940.24	56907.24	10	270.00	930	4570	692850	▇▁▁▁▁
colony_reno_pct	130	0.79	9.33	10.57	1	3.00	6	13	77	▇▁▁▁▁

Introduction

This report provides information on honey bee colonies in the US regarding the number of colonies, maximum, lost, percent lost, added, renovated, and percent renovated. Colony loss rates are calculated as the ratio of the number of colonies lost to the number of colonies managed over a defined period. Colony loss rates are best interpreted as a turn-over rate, as high levels of losses do not necessarily result in a decrease in the total number of colonies managed in the United States. I hypothesized that the most bee colonies were lost October-December and the least number of bee colonies were lost April-June. It was also hypothesized that the most bee colonies were lost during 2015 and the least bee colonies were lost during 2017, when focusing on year.

Exploratory Data

I decided to only use data pertaining to the state of Florida, so I could get a general idea of the average values for bee colonies in this state. I chose Florida because while skimming the data it looked like it was higher (numerically) when comparing it to the other states. Once I was able to isolate the data from the state of Florida, it became obvious that the average percent of colonies lost (14%) was greater than the original average percent of colonies lost (11%) when creating our exploratory data.

training_data %>%
  filter(state == "Florida") %>%
  skim()

Data summary
Name	Piped data
Number of rows	16
Number of columns	10
_______________________
Column type frequency:
character	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
months	0	1	10	16	0	4	0
state	0	1	7	7	0	1	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2017.69	1.78	2015	2016.75	2018	2019.0	2020	▇▆▆▆▆
colony_n	1	0.94	238200.00	41658.82	176000	210000.00	240000	265000.0	305000	▆▇▆▃▆
colony_max	1	0.94	258666.67	39300.43	180000	235000.00	260000	285000.0	315000	▂▆▇▆▇
colony_lost	1	0.94	36333.33	8574.60	20000	30000.00	35000	43500.0	50000	▂▇▅▃▆
colony_lost_pct	1	0.94	14.33	3.90	8	12.00	14	16.5	22	▅▅▇▃▃
colony_added	1	0.94	47733.33	12144.76	30000	40000.00	46000	54000.0	80000	▆▇▇▁▂
colony_reno	1	0.94	30633.33	14386.83	8000	21250.00	31000	37000.0	69000	▅▅▇▁▁
colony_reno_pct	1	0.94	12.40	6.30	3	7.50	12	15.0	25	▆▅▇▂▃

This table allows me to look at data specific to the state of Florida. This table tells me that Florida has a mean value of 14% for percent of average colony lost. This is a slight increase from the first table created at the beginning of the report, which gave a value of 11% for the average percent of honey bee colonies lost. Seeing how both values are precise makes me more confident of this data.

Next, I decided to look during which months the greatest number of honey bee colonies were lost to see if there was a possible correlation between the time of year (months) and amount of colony lost. I originally used a bar graph to display this data, but quickly discovered a box plot would do a much better job of displaying the data, which I then created immediately following this graph.

training_data %>%
  ggplot() +
  geom_bar(aes(x=months, y= colony_lost),
           stat="identity")

Warning: Removed 27 rows containing missing values (`position_stack()`).

This graph made me realize that a bar graph was not the best format to use when displaying my data. Although the data could be organized in a better way, this graph told me that the most (top 4) bee colonies were lost: April-June, January-March, July-September, and October-December. If we were actually to look closer, however, we see that the most bee colonies were lost October-December.

I decided to create a box plot with my months and colony lost data once again to create a graph that would hopefully allow me to understand the results better. Looking at the graph we see that that data is organized in a much more visually pleasing manner that allows for better analysis. Similar to the bar graph, this graph shows the top four time periods where the greatest number of bee colonies were lost.

training_data %>%
  ggplot() +
  geom_boxplot(aes(x=months, y= colony_lost)) +
  scale_y_log10()

Warning: Removed 27 rows containing non-finite values (`stat_boxplot()`).

This graph, especially when the scale_y_log10() was added to the code, displayed the data in a more visually appealing way. Not only were we able to look at the box-plot and see that the most bee colonies were lost October-December, but we also learned that the amount of bee colonies lost were very numerically close between the different months of time. This makes me a little wary of the data because it makes me question how accurate the data is, I would have assumed there to be a more drastic difference between the months (more specifically Winter and Spring/Fall/Summer).

Next I went ahead and created another box plot, but this time focusing on the year and colony lost. I wanted to see if there was a correlation between the year and colony lost. I also wanted to see if it was there were any similarities between months and colony lost and years and colony lost. The first graph was not successful in that it only plotted the data for the year 2018 so I was really unable to gather much information from this box-plot.

training_data %>%
  ggplot() +
  geom_boxplot(aes(x=colony_lost, y= year))

Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

Warning: Removed 27 rows containing missing values (`stat_boxplot()`).

This box-plot was my first attempt at looking at the (possible) correlation between the amount of honey bee colonies lost and the year. I was curious to see if there was a year where honey bees were more greatly affected. Unfortunately, this graph is only recognizing the year 2018.

After going back and looking over my data I decided to try and make a plot graph to better display my data, but this time I was going to focus on colony lost and colony added. I decided to focus on a different aspect within my data because I had not been so successful focusing on time (months/years) and colony lost. Below you can see that the graph actually ended up coming out really nicely and was a lot easier visually to digest.

training_data %>%
  ggplot() +
  geom_point(aes(x=colony_lost, y=colony_added)) +
  geom_smooth(aes(x=colony_lost, y=colony_added),
              method="lm", se= FALSE, color="red")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 45 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 45 rows containing missing values (`geom_point()`).

This graph shows the correlation between the amount of honey bee colonies lost and amount of honey bees colonies added. A line of best fit was created to better analyze and understand the data. Looking at the line of best fit we see that the lower numbers appear more accurate compared to the higher data sets where there are far less data points and more outliers present.

Conclusion

Using RStudio I was able to answer my original questions regarding the Honey Bee Colony Data Set. For starters, it was concluded that the most bee colonies were lost during October-December. However, RStudio gave us the top four periods of time (months) that lost the most bee colonies and all were suspiciously close numerically. More data should be acquired on the possible (human) error that occurred while collecting this data set. Second, by creating a plot graph, we were able to deduce that there is a correlation between the amount of honey bee colonies added and the amount of colony lost. Once the line of best fit was added, it was clear that more graphs would help us better understand what we were seeing (a grouping of low numerical values and then some higher numerical values that strayed much further from the line of best fit). This report was an initial attempt to understand how RStudio works and its ability to organize biological data. Moving forward, more data pertaining to our primary questions should be gathered and more detailed graphs created.