Crop-cut data: India¶

In this chapter we pre-process rice crop-cut data from Odisha, collected by Iftikar et al., 2017:

Wasim Iftikar; Nabakishore Parida; Vivek Kumar; Narayan Chandra Banik; Amit Mishra, 2017, “Odisha Rice Crop Cut Data 2013 - 2016”, hdl:11529/11111, CIMMYT Research Data & Software Repository Network, V2

The data are avaialble on the CIMMYT data repository https://data.cimmyt.org/dataset.xhtml?persistentId=hdl:11529/11111

We are not reproducing a paper here, we are just finding out how to pre-process the data so that they can be used for further analysis.

Get the data¶

We use the agro package to download the data.

ff <- agro::get_data_from_uri("hdl:11529/11111", ".")
ff
## [1] "./hdl_11529_11111/CSISA_OD_RiceCropCut_AllYearRawDataFinal.csv"
## [2] "./hdl_11529_11111/CSISA_OD_RiceCropCut_MetaSheet_1.csv"

We have two .csv files. One with data, and one with metadata.

ff[1]
## [1] "./hdl_11529_11111/CSISA_OD_RiceCropCut_AllYearRawDataFinal.csv"
d <- read.csv(ff[1], stringsAsFactors = FALSE)
dim(d)
## [1] 816  33

There are 816 records and 33 variables, with these names

colnames(d)
##  [1] "Year"                "Season"              "District_Name"
##  [4] "FID"                 "Latitude"            "Longitude"
##  [7] "CEM"                 "VAR"                 "DOS"
## [10] "DOT"                 "Seed_Rate"           "DAP_Total"
## [13] "MoP_Total"           "Urea_Total"          "Urea_1_Top"
## [16] "MoP_1_Top"           "Urea_2_Top"          "MoP_2_Top"
## [19] "Urea_3_Top"          "MoP_3_Top"           "Pre_Herbicide_Name"
## [22] "Pre_Herbicide_Amt"   "Post_Herbicide_Name" "Post_Herbicide_Amt"
## [25] "H_Weed_No."          "DOH"                 "TAGB_A"
## [28] "TAGB_B"              "TAGB_C"              "Yield_A"
## [31] "Yield_B"             "Yield_C"             "Gr_Moisture"

At this point it is good to explore the data a bit using functions such as summary, View, plot and table. For example

table(d$Year)
##
## 2013 2014 2015 2016
##  149  250  201  216

Showing that we have data for four years.

Compute mean yield¶

We are dealing with crop cut data, and our first interest is in the observed yield. There are three columns with “Yield” in the name. So let’s check these out.

yield <- d[, c("Yield_A", "Yield_B", "Yield_C")]
head(yield)
##   Yield_A Yield_B Yield_C
## 1    2800    2800    2700
## 2    2200    2100    2100
## 3    2100    2000    2040
## 4    2400    2400    2100
## 5    1900    1900    2100
## 6    1500    1420    1400

Look’s good, let’s get the mean yield

y_avg <- rowMeans(yield)
## Error in rowMeans(yield): 'x' must be numeric

Apparently, our yield data are not numeric. Let’s check that.

sapply(yield, class)
##     Yield_A     Yield_B     Yield_C
##   "numeric" "character" "character"

We can change the character fields to numeric

y <- apply(yield, 2, as.numeric)
## Warning in apply(yield, 2, as.numeric): NAs introduced by coercion
## Warning in apply(yield, 2, as.numeric): NAs introduced by coercion

But now we get a warning saying that some values cannot be converted to numbers. Never just ignore warnings like this. You have to investigate why they occur before you can continue. Often this means that there is an error in the data.

Let’s find some of the records that could not be converted.

na_rows <- apply(is.na(y), 1, any)
table(na_rows)
## na_rows
## FALSE  TRUE
##   667   149
nas <- which(na_rows)
head(nas)
## [1] 668 669 670 671 672 673
yield[nas[1:5], ]
##     Yield_A Yield_B Yield_C
## 668   12.50       -       -
## 669   18.00       -       -
## 670   12.00       -       -
## 671   14.08       -       -
## 672   13.00       -       -

Ah, some records only have missing data indicated by "-". Let’s see if that is connected to the year of data collection.

table(d$Year, na_rows)
##       na_rows
##        FALSE TRUE
##   2013     0  149
##   2014   250    0
##   2015   201    0
##   2016   216    0

It is indeed connected to the year. All records for 2013 have only one yield measurement, all other years have three samples per field.

Instead of continuing, we should see what we can do to allow converting these data without getting a warning. We should be able to do that by first replacing the "-" with NA.

yield[yield == "-"] <- NA
y <- apply(yield, 2, as.numeric)

All right, with that problem solved, let’s try again to compute the the mean yield for each row. Note that we have to use na.rm=TRUE to avoid getting missing values for 2013.

y_avg <- rowMeans(y, na.rm=TRUE)

To keep all records together, we can add y_avg back to our original data, data.frame d

d$yield <- y_avg

Let’s see what the yields look like in Odisha:

plot(sort(d$yield))

Yikes, that does not look right. Let’s look at the mean yield per year

tapply(d$yield, d$Year, mean, na.rm=TRUE)
##         2013         2014         2015         2016
##    9.5491275    2.6973987    0.5751542 2283.5231481

Or as a boxplot

boxplot(yield~Year, data=d)

These are wildly different numbers. Perhaps different units? Let’s consult the metadata!

Metadata¶

The second file contains metadata (at least that is what the file name suggests).

ff[2]
## [1] "./hdl_11529_11111/CSISA_OD_RiceCropCut_MetaSheet_1.csv"
md <- read.csv(ff[2], stringsAsFactors = FALSE, header=FALSE)
head(md)
##             V1                                               V2
## 1      Project Cereal Systems Initiative for South Asia (CSISA)
## 2        Donor                                     BMGF & USAID
## 3      Country                                            India
## 4          Hub                                           Odisha
## 5 DocumentName                                      RiceCropCut
## 6 ProducedYear                                             2017

This .csv file did not have a “header” (no column/variable names), so we add them for clarity

colnames(md) <- c("var", "value")

Let’s look for the discription of our yield variables

i <- grep("Yield", md$var)
i
## [1] 48 49 50
md$var[i]
## [1] "Yield_A " "Yield_B"  "Yield_C"

Note the space trailing “Yield_A”? Additional whitespace (space, tab) like that can be difficult to spot and cause a lot trouble later on, so let’s remove all leading or trailing whitespace first. This is really very important for data processing, not so much for the meta-data, but it does not hurt to be careful.

md <- apply(md, 2, trimws)
# apply returns a matrix, let's keep a data.frame (always with stringsAsFactors=FALSE)
md <- data.frame(md, stringsAsFactors = FALSE)

Now inspect the values

md$value[i]
## [1] "Grain weight at spot A\nYear 2013 - 5X5 sq meter in Kg\nYear 2014 - 2X2.5 sq meter in Kg\nYear 2015 - 1X1 sq meter in gm\nYear 2016 - 2X2 sq meter in gm"
## [2] "Grain weight at spot B\nYear 2013 - 5X5 sq meter in Kg\nYear 2014 - 2X2.5 sq meter in Kg\nYear 2015 - 1X1 sq meter in gm\nYear 2016 - 2X2 sq meter in gm"
## [3] "Grain weight at spot C\nYear 2013 - 5X5 sq meter in Kg\nYear 2014 - 2X2.5 sq meter in Kg\nYear 2015 - 1X1 sq meter in gm\nYear 2016 - 2X2 sq meter in gm"

The variables are the same, except for the “spot A”, “spot B”, and “spot C” bits. You can check that by removing the first part of each value and then taking the unique value. Note that in the pattern provided to the gsub function below the . serves a “wild card”: it matches any character. Also note the presence of \n which represents a line break.

ymd <- gsub("Grain weight at spot .\n", "", md$value[i])
ymd <- unique(ymd)
ymd
## [1] "Year 2013 - 5X5 sq meter in Kg\nYear 2014 - 2X2.5 sq meter in Kg\nYear 2015 - 1X1 sq meter in gm\nYear 2016 - 2X2 sq meter in gm"

To show the effect of the line breaks, and to allow for easier reading, we can use cat

cat(ymd)
## Year 2013 - 5X5 sq meter in Kg
## Year 2014 - 2X2.5 sq meter in Kg
## Year 2015 - 1X1 sq meter in gm
## Year 2016 - 2X2 sq meter in gm

All right, so different years have different units. That is unexpected. In most cases all values of a variable (column) in a data set have the same unit. But, unfortunately, that is not always the case — and that can lead to a lot of errors (e.g. temperature in °F and °C, or elevation in meter and feet). Sometimes the unit is stored in another column, but in this case it is not, so we shall do that ourselves. Here we need to adjust for both the area harvested and the unit of mass used.

First make a table of the unit by year. In this case I would do that “by hand”:

units <- data.frame(Year=2013:2016, area=c(25, 2*2.5 ,1 , 4), tokg=c(1,1,1000,1000))
units
##   Year area tokg
## 1 2013   25    1
## 2 2014    5    1
## 3 2015    1 1000
## 4 2016    4 1000

The tokg variable indicates by how much the number needs to be divided to get kg (from g).

We already saw that in 2013 there was only one sample per field, but now we see that the plots were larger. So it appears that there was a change in the protocol from one larger sample to three smaller samples.

Compute yield per ha¶

We are now ready to adjust the yield data from the raw values to a single unit (we’ll use kg/ha). First we merge the data d with the new table units. When using merge you have to be careful that you do not loose records because of bad matching of values. So I check the number of records before and after merging.

dim(d)
## [1] 816  34
dm <- merge(d, units, by="Year")
dim(dm)
## [1] 816  36

The resulting data

head(dm[, c("Year", "yield", "area", "tokg")])
##   Year yield area tokg
## 1 2013 12.50   25    1
## 2 2013 18.00   25    1
## 3 2013 12.00   25    1
## 4 2013 14.08   25    1
## 5 2013 13.00   25    1
## 6 2013 11.50   25    1
tail(dm[, c("Year", "yield", "area", "tokg")])
##     Year    yield area tokg
## 811 2016 2278.333    4 1000
## 812 2016 2449.333    4 1000
## 813 2016 2473.333    4 1000
## 814 2016 2338.667    4 1000
## 815 2016 2245.333    4 1000
## 816 2016 2434.000    4 1000
table(dm[, c("Year", "area")])
##       area
## Year     1   4   5  25
##   2013   0   0   0 149
##   2014   0   0 250   0
##   2015 201   0   0   0
##   2016   0 216   0   0

Compute yield in kg/ha

dm$yha <- (10000 / dm$area) * (dm$yield / dm$tokg)

Let’s again check the numbers

tapply(dm$yha, dm$Year, mean, na.rm=TRUE)
##        2013        2014        2015        2016
## 3819.651007 5394.797333    5.751542 5708.807870

The yield for 2015 is much too low now whereas the other values are plausible rice yields in kg/ha. So I assume that the 2015 data were, in fact, in kg/ha, and not in g/ha.

dm$tokg[dm$Year==2015] <- 1
dm$yha <- (10000 / dm$area) * (dm$yield / dm$tokg)

Now the yield data look plausible

tapply(dm$yha, dm$Year, mean, na.rm=TRUE)
##     2013     2014     2015     2016
## 3819.651 5394.797 5751.542 5708.808

A final step is to correct for variation in moisture content. We can use the field Gr_Moisture but there are again missing values.

dm$Gr_Moisture[dm$Gr_Moisture == "-"] <- NA
dm$Gr_Moisture <- as.numeric(dm$Gr_Moisture)
table(dm$Year, is.na(dm$Gr_Moisture))
##
##        FALSE TRUE
##   2013   149    0
##   2014   156   94
##   2015   200    1
##   2016   216    0
boxplot(Gr_Moisture~Year, data=dm, col=rainbow(6), ylab="Yield (kg/ha)", cex.axis=1.25, cex.lab=1.25)

The table shows that there is 1 missing value in 2015, and 94 in 2014. The boxplot shows different moisture contents by years, so lets replace the missing values with the mean value for each year. Compute the mean values:

moisture <- aggregate(dm[, "Gr_Moisture", drop=FALSE], dm[, "Year", drop=FALSE], mean, na.rm=TRUE)
colnames(moisture)[2] <- "avg_moisture"
moisture
##   Year avg_moisture
## 1 2013     16.15705
## 2 2014     17.34359
## 3 2015     19.43600
## 4 2016     18.56995

Merge with the data

dim(dm)
## [1] 816  37
dm <- merge(dm, moisture, by="Year")
dim(dm)
## [1] 816  38

Find and replace the missing values

i <- is.na(dm$Gr_Moisture)
dm$Gr_Moisture[i] <- dm$avg_moisture[i]
table(is.na(dm$Gr_Moisture))
##
## FALSE
##   816
boxplot(Gr_Moisture~Year, data=dm, col=rainbow(6), ylab="Yield (kg/ha)", cex.axis=1.25, cex.lab=1.25)

The boxplot shows that the distribution has changed for 2014 — as there are now 97 values with the same (mean) value.

For rice, standard moisture content is 12%. We can compute it like this

dm$yha <- 12 * dm$yha / dm$Gr_Moisture
boxplot(yha~Year, data=dm, col=rainbow(6), ylab="Yield (kg/ha)", cex.axis=1.25, cex.lab=1.25)

Fertilizer¶

These are the fertilizer variables

fert <- c("DAP_Total", "MoP_Total", "Urea_Total", "Urea_1_Top", "MoP_1_Top", "Urea_2_Top", "MoP_2_Top", "Urea_3_Top", "MoP_3_Top")
md[md$var %in% fert, ]
##           var
## 30  DAP_Total
## 31  MoP_Total
## 32 Urea_Total
## 33 Urea_1_Top
## 34  MoP_1_Top
## 35 Urea_2_Top
## 36  MoP_2_Top
## 37 Urea_3_Top
## 38  MoP_3_Top
##                                                                                  value
## 30 Total amount of Di-ammonium Phosphate used in Kg per acre throughout the crop cycle
## 31      Total amount of Murate of Potash used in Kg per acre throughout the crop cycle
## 32                  Total amount of Urea used in Kg per acre throughout the crop cycle
## 33                Amount of Urea used in Kg per acre at the time of first top dressing
## 34    Amount of Murate of Potash used in Kg per acre at the time of first top dressing
## 35               Amount of Urea used in Kg per acre at the time of second top dressing
## 36   Amount of Murate of Potash used in Kg per acre at the time of second top dressing
## 37                Amount of Urea used in Kg per acre at the time of third top dressing
## 38    Amount of Murate of Potash used in Kg per acre at the time of third top dressing

A check to see if there are values. Get the max value for each variable by year.

aggregate(dm[, fert], dm[,"Year",drop=FALSE], max, na.rm=TRUE)
## Warning in FUN(X[[i]], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(X[[i]], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(X[[i]], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(X[[i]], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(X[[i]], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(X[[i]], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(X[[i]], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(X[[i]], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(X[[i]], ...): no non-missing arguments to max; returning -Inf
##   Year DAP_Total MoP_Total Urea_Total Urea_1_Top MoP_1_Top Urea_2_Top MoP_2_Top
## 1 2013        75      50.0          0         60         0         25         0
## 2 2014       100       0.0          0         40        25         50        30
## 3 2015       100      62.5         65          0         0          0         0
## 4 2016      -Inf      -Inf       -Inf       -Inf      -Inf       -Inf      -Inf
##   Urea_3_Top MoP_3_Top
## 1         40         0
## 2         50        30
## 3          0         0
## 4       -Inf      -Inf

Lots of warnings because there are no fertilizer data for 2016. Thus all values for that year are NA and you cannot get a max value in that case for a group that only has NA. You could use this function function(i) ifelse(all(is.na(i)), NA, max(i, na.rm=TRUE))) instead, to get rid of the warnings, but at the cost of tremendous verbosity.

All right, 2016 is out. Alose notice that in 2013 and 2014 the maximum value of Urea_Total is 0, whereas there Urea_Top variables show that this should not be the case. The opposite is the case for 2015: there is Urea_Total data, but no data for the split applications.

The situation is similar for MoP_Total, with no values in 2014. Let’s fix that.

i <- dm$Year %in% c(2013, 2014)
sum(i)
## [1] 399
dm$Urea_Total[i] <- rowSums(dm[i, c("Urea_1_Top", "Urea_2_Top", "Urea_3_Top")])
j <- dm$Year == 2014
sum(j)
## [1] 250
dm$MoP_Total[j] <- rowSums(dm[j, c("MoP_1_Top", "MoP_2_Top", "MoP_3_Top")])

Let’s check

df <- dm[dm$Year != 2016, ]
aggregate(df[, fert[1:3]], df[,"Year",drop=FALSE], mean, na.rm=TRUE)
##   Year DAP_Total MoP_Total Urea_Total
## 1 2013  32.59732  26.63793   36.83893
## 2 2014  25.76400  15.54800   32.51600
## 3 2015  36.30597  27.38065   35.73134