This document describes functions in the myrwaR
package for handling STORET data. First, we’ll walk through how to fetch MyRWA data that is currently in STORET. Then we will compare the STORET dataset to what is currently in the Access database.
library(myrwaR)
library(lubridate)
library(dplyr)
library(tidyr)
The MyRWA locations currently in STORET can be fetched using the get_storet_locations()
function. This function will download a zip file from STORET, unzip the contents, and load the resulting station.csv
into a data frame. Note that verbose
is optional and TRUE
by default, but you can set it to FALSE
to suppress the log output.
storet_locations <- get_storet_locations(verbose = TRUE)
#> Fetching locations table from STORET...OK
#> STORET locations zip file saved to: C:\Users\Jeff\AppData\Local\Temp\RtmpOajdqC\storet_stn_22f4ca2756e.zip
#> STORET locations csv file saved to: C:/Users/Jeff/AppData/Local/Temp/RtmpOajdqC/station.csv
str(storet_locations)
#> 'data.frame': 15 obs. of 36 variables:
#> $ OrganizationIdentifier : chr "MYRWA" "MYRWA" "MYRWA" "MYRWA" ...
#> $ OrganizationFormalName : chr "Mystic River Watershed Association (Massachusetts)" "Mystic River Watershed Association (Massachusetts)" "Mystic River Watershed Association (Massachusetts)" "Mystic River Watershed Association (Massachusetts)" ...
#> $ MonitoringLocationIdentifier : chr "MYRWA-ABR006" "MYRWA-ABR028" "MYRWA-ABR049" "MYRWA-ALB006" ...
#> $ MonitoringLocationName : chr "Aberjona River at USGS Gaging Station in Winchester; the bank upstream of weir" "Aberjona River at Washington Street in Winchester; upstream side of the bridge" "Aberjona River at Salem Street in Woburn; downstream side of the bridge" "Alewife Brook at Broadway Bridge in Somerville; downstream of the bridge on the bank" ...
#> $ MonitoringLocationTypeName : chr "River/Stream" "River/Stream" "River/Stream" "River/Stream" ...
#> $ MonitoringLocationDescriptionText : logi NA NA NA NA NA NA ...
#> $ HUCEightDigitCode : int 1090001 1090001 1090001 1090001 NA NA NA 1090001 1090001 1090001 ...
#> $ DrainageAreaMeasure.MeasureValue : logi NA NA NA NA NA NA ...
#> $ DrainageAreaMeasure.MeasureUnitCode : logi NA NA NA NA NA NA ...
#> $ ContributingDrainageAreaMeasure.MeasureValue : logi NA NA NA NA NA NA ...
#> $ ContributingDrainageAreaMeasure.MeasureUnitCode: logi NA NA NA NA NA NA ...
#> $ LatitudeMeasure : num 42.4 42.5 42.5 42.4 42.4 ...
#> $ LongitudeMeasure : num -71.1 -71.1 -71.1 -71.1 -71 ...
#> $ SourceMapScaleNumeric : logi NA NA NA NA NA NA ...
#> $ HorizontalAccuracyMeasure.MeasureValue : logi NA NA NA NA NA NA ...
#> $ HorizontalAccuracyMeasure.MeasureUnitCode : logi NA NA NA NA NA NA ...
#> $ HorizontalCollectionMethodName : chr "GPS-Unspecified" "GPS-Unspecified" "GPS-Unspecified" "GPS-Unspecified" ...
#> $ HorizontalCoordinateReferenceSystemDatumName : chr "WGS84" "WGS84" "WGS84" "WGS84" ...
#> $ VerticalMeasure.MeasureValue : logi NA NA NA NA NA NA ...
#> $ VerticalMeasure.MeasureUnitCode : logi NA NA NA NA NA NA ...
#> $ VerticalAccuracyMeasure.MeasureValue : logi NA NA NA NA NA NA ...
#> $ VerticalAccuracyMeasure.MeasureUnitCode : logi NA NA NA NA NA NA ...
#> $ VerticalCollectionMethodName : logi NA NA NA NA NA NA ...
#> $ VerticalCoordinateReferenceSystemDatumName : logi NA NA NA NA NA NA ...
#> $ CountryCode : chr "US" "US" "US" "US" ...
#> $ StateCode : int 25 25 25 25 25 25 25 25 25 25 ...
#> $ CountyCode : int 17 17 17 17 25 25 17 17 17 25 ...
#> $ AquiferName : logi NA NA NA NA NA NA ...
#> $ FormationTypeText : logi NA NA NA NA NA NA ...
#> $ AquiferTypeName : logi NA NA NA NA NA NA ...
#> $ ConstructionDateText : logi NA NA NA NA NA NA ...
#> $ WellDepthMeasure.MeasureValue : logi NA NA NA NA NA NA ...
#> $ WellDepthMeasure.MeasureUnitCode : logi NA NA NA NA NA NA ...
#> $ WellHoleDepthMeasure.MeasureValue : logi NA NA NA NA NA NA ...
#> $ WellHoleDepthMeasure.MeasureUnitCode : logi NA NA NA NA NA NA ...
#> $ ProviderName : chr "STORET" "STORET" "STORET" "STORET" ...
The MyRWA results table currently in STORET can be fetched using the get_storet_results()
function. This function will download a zip file from STORET, unzip the contents, and load the resulting results.csv
into a data frame. Note that the verbose
argument can be used here too.
storet_results <- get_storet_results()
#> Fetching results table from STORET...OK
#> STORET results zip file saved to: C:\Users\Jeff\AppData\Local\Temp\RtmpOajdqC\storet_results_22f439813589.zip
#> STORET results csv file saved to: C:/Users/Jeff/AppData/Local/Temp/RtmpOajdqC/result.csv
str(storet_results)
#> 'data.frame': 20686 obs. of 63 variables:
#> $ OrganizationIdentifier : chr "MYRWA" "MYRWA" "MYRWA" "MYRWA" ...
#> $ OrganizationFormalName : chr "Mystic River Watershed Association (Massachusetts)" "Mystic River Watershed Association (Massachusetts)" "Mystic River Watershed Association (Massachusetts)" "Mystic River Watershed Association (Massachusetts)" ...
#> $ ActivityIdentifier : chr "MYRWA-BEI093:201112160721:L:S" "MYRWA-MYRMMP:201102090712:F:S" "MYRWA-CHR95S:201101250736:F:S" "MYRWA-MYR275:201110170706:L:S" ...
#> $ ActivityTypeCode : chr "Sample-Routine" "Field Msr/Obs" "Field Msr/Obs" "Sample-Routine" ...
#> $ ActivityMediaName : chr "Water" "Water" "Water" "Water" ...
#> $ ActivityMediaSubdivisionName : logi NA NA NA NA NA NA ...
#> $ ActivityStartDate : Date, format: "2011-12-16" "2011-02-09" ...
#> $ ActivityStartTime.Time : chr "07:21:00" "07:12:00" "07:36:00" "06:06:00" ...
#> $ ActivityStartTime.TimeZoneCode : chr "EST" "EST" "EST" "EST" ...
#> $ ActivityEndDate : logi NA NA NA NA NA NA ...
#> $ ActivityEndTime.Time : logi NA NA NA NA NA NA ...
#> $ ActivityEndTime.TimeZoneCode : logi NA NA NA NA NA NA ...
#> $ ActivityDepthHeightMeasure.MeasureValue : logi NA NA NA NA NA NA ...
#> $ ActivityDepthHeightMeasure.MeasureUnitCode : logi NA NA NA NA NA NA ...
#> $ ActivityDepthAltitudeReferencePointText : logi NA NA NA NA NA NA ...
#> $ ActivityTopDepthHeightMeasure.MeasureValue : logi NA NA NA NA NA NA ...
#> $ ActivityTopDepthHeightMeasure.MeasureUnitCode : logi NA NA NA NA NA NA ...
#> $ ActivityBottomDepthHeightMeasure.MeasureValue : logi NA NA NA NA NA NA ...
#> $ ActivityBottomDepthHeightMeasure.MeasureUnitCode : logi NA NA NA NA NA NA ...
#> $ ProjectIdentifier : chr "BASE" "BASE" "BASE" "BASE" ...
#> $ ActivityConductingOrganizationText : logi NA NA NA NA NA NA ...
#> $ MonitoringLocationIdentifier : chr "MYRWA-BEI093" "MYRWA-MYRMMP" "MYRWA-CHR95S" "MYRWA-MYR275" ...
#> $ ActivityCommentText : logi NA NA NA NA NA NA ...
#> $ SampleAquifer : logi NA NA NA NA NA NA ...
#> $ HydrologicCondition : logi NA NA NA NA NA NA ...
#> $ HydrologicEvent : logi NA NA NA NA NA NA ...
#> $ SampleCollectionMethod.MethodIdentifier : chr "grab" "USEPA" "USEPA" "grab" ...
#> $ SampleCollectionMethod.MethodIdentifierContext : chr "MYRWA" "USEPA" "USEPA" "MYRWA" ...
#> $ SampleCollectionMethod.MethodName : chr "Grab Sample" "USEPA" "USEPA" "Grab Sample" ...
#> $ SampleCollectionEquipmentName : chr "Water Bottle" "Unknown" "Unknown" "Water Bottle" ...
#> $ ResultDetectionConditionText : chr "" "" "" "" ...
#> $ CharacteristicName : chr "Dissolved oxygen (DO)" "Temperature, water" "Temperature, water" "Dissolved oxygen (DO)" ...
#> $ ResultSampleFractionText : chr "" "" "" "" ...
#> $ ResultMeasureValue : num 62.7 -1.5 -1 6.4 64.3 ...
#> $ ResultMeasure.MeasureUnitCode : chr "%" "deg C" "deg C" "mg/l" ...
#> $ MeasureQualifierCode : chr "" "" "" "" ...
#> $ ResultStatusIdentifier : chr "Final" "Final" "Final" "Final" ...
#> $ StatisticalBaseCode : logi NA NA NA NA NA NA ...
#> $ ResultValueTypeName : chr "Calculated" "Actual" "Actual" "Actual" ...
#> $ ResultWeightBasisText : logi NA NA NA NA NA NA ...
#> $ ResultTimeBasisText : logi NA NA NA NA NA NA ...
#> $ ResultTemperatureBasisText : logi NA NA NA NA NA NA ...
#> $ ResultParticleSizeBasisText : logi NA NA NA NA NA NA ...
#> $ PrecisionValue : logi NA NA NA NA NA NA ...
#> $ ResultCommentText : chr "" "" "" "" ...
#> $ USGSPCode : logi NA NA NA NA NA NA ...
#> $ ResultDepthHeightMeasure.MeasureValue : logi NA NA NA NA NA NA ...
#> $ ResultDepthHeightMeasure.MeasureUnitCode : logi NA NA NA NA NA NA ...
#> $ ResultDepthAltitudeReferencePointText : logi NA NA NA NA NA NA ...
#> $ SubjectTaxonomicName : logi NA NA NA NA NA NA ...
#> $ SampleTissueAnatomyName : logi NA NA NA NA NA NA ...
#> $ ResultAnalyticalMethod.MethodIdentifier : chr "360.2" "170.1" "170.1" "360.2" ...
#> $ ResultAnalyticalMethod.MethodIdentifierContext : chr "USEPA" "USEPA" "USEPA" "USEPA" ...
#> $ ResultAnalyticalMethod.MethodName : chr "Dissolved Oxygen by Winkler Technique" "Temperature" "Temperature" "Dissolved Oxygen by Winkler Technique" ...
#> $ MethodDescriptionText : chr "https://www.nemi.gov/methods/method_summary/5253/" "https://www.nemi.gov/methods/method_summary/5216/" "https://www.nemi.gov/methods/method_summary/5216/" "https://www.nemi.gov/methods/method_summary/5253/" ...
#> $ LaboratoryName : logi NA NA NA NA NA NA ...
#> $ AnalysisStartDate : logi NA NA NA NA NA NA ...
#> $ ResultLaboratoryCommentText : logi NA NA NA NA NA NA ...
#> $ DetectionQuantitationLimitTypeName : logi NA NA NA NA NA NA ...
#> $ DetectionQuantitationLimitMeasure.MeasureValue : logi NA NA NA NA NA NA ...
#> $ DetectionQuantitationLimitMeasure.MeasureUnitCode: logi NA NA NA NA NA NA ...
#> $ PreparationStartDate : logi NA NA NA NA NA NA ...
#> $ ProviderName : chr "STORET" "STORET" "STORET" "STORET" ...
This section compares the number of samples by different variables (e.g. location, characteristic, year) between STORET and the current water quality database.
First, we need to extract the dataset from the current database using the function wqx_results()
(this will convert the dataset to WQX format, which is similar, though not identical, to the format provided by STORET):
ch <- db_connect("D:/Dropbox/Work/mystic/db/MysticDB_20160208.accdb")
db_results <- wqx_results(ch, projects='BASE')
close(ch)
Next, we’ll convert the ActivityStartDate
column in the db_results
data frame to Date
types so we can extract the Year later, and also prepend "MYRWA-"
to the MonitoringLocationID
and ActivityID
columns since the data returned from STORET includes this prefix but the WQX export format does not.
db_results <- mutate(db_results,
ActivityStartDate = as.Date(ActivityStartDate),
MonitoringLocationID = paste("MYRWA", MonitoringLocationID, sep = "-"),
ActivityID = paste("MYRWA", ActivityID, sep = "-"))
To compare the STORET dataset to that in the current database, we can use the storet_compare_count()
function. This function takes two arguments, one for each dataset. Each argument should be a named list containing one or more columns from the original results table. The names in each list let you rename the columns from the original dataset (e.g. LocationID = storet_results$MonitoringLocationIdentifier)
) and also define names for new columns (e.g. Year = year(storet_results$ActivityStartDate)
). The names in the two list arguments (storet
and db
) must be identical and cannot be "Source"
or "n"
, which are created internally within the function. For each list, the function will count the number of samples for each unique combination of values, and then compares those counts between the two datasets to identify any differences.
First, let’s compare the number of samples by Location ID. Note how the names of the original columns are different in storet_results
and db_results
, and thus are renamed to both be LocationID
within the list arguments.
storet_compare_count(storet = list(LocationID = storet_results$MonitoringLocationIdentifier),
db = list(LocationID = db_results$MonitoringLocationID))
#> Sample counts by ( LocationID ) are different:
#>
#> LocationID DB STORET STORET-DB
#> MYRWA-ABR006 1798 1728 -70
#> MYRWA-ABR028 1828 1758 -70
#> MYRWA-ABR049 1756 1697 -59
#> MYRWA-ALB006 1883 1819 -64
#> MYRWA-BEI093 607 615 8
#> MYRWA-CHR95S 796 706 -90
#> MYRWA-MAR036 1754 1693 -61
#> MYRWA-MEB001 1795 1725 -70
#> MYRWA-MIB001 1809 1741 -68
#> MYRWA-MIC004 790 702 -88
#> MYRWA-MYR071 1801 1733 -68
#> MYRWA-MYR275 732 636 -96
#> MYRWA-MYRMMP 829 727 -102
#> MYRWA-UPL001 1783 1703 -80
#> MYRWA-WIB001 1783 1703 -80
#> MYRWA-BEI001 102 0 -102
Next, let’s compare the number of samples by CharacteristicName
:
storet_compare_count(storet = list(CharacteristicName = storet_results$CharacteristicName),
db = list(CharacteristicName = db_results$CharacteristicName))
#> Sample counts by ( CharacteristicName ) are different:
#>
#> CharacteristicName DB STORET STORET-DB
#> Ammonia-nitrogen 542 472 -70
#> Dissolved oxygen (DO) 4879 4635 -244
#> Enterococcus 416 356 -60
#> Escherichia coli 1800 1729 -71
#> Inorganic nitrogen (nitrate and nitrite) 2505 2375 -130
#> Phosphorus 2609 2479 -130
#> Specific conductance 2550 2420 -130
#> Temperature, water 2500 2374 -126
#> Total suspended solids 2564 2434 -130
#> Turbidity 187 118 -69
And then, we’ll compare counts by Year
. Note how the columns are converted from dates to years in each list using the year()
function.
storet_compare_count(storet = list(Year = year(storet_results$ActivityStartDate)),
db = list(Year = year(db_results$ActivityStartDate)))
#> Sample counts by ( Year ) are different:
#>
#> Year DB STORET STORET-DB
#> 2015 1794 634 -1160
Finally, we’ll compare counts by ActivityStartDate
:
storet_compare_count(storet = list(Date = storet_results$ActivityStartDate),
db = list(Date = db_results$ActivityStartDate))
#> Sample counts by ( Date ) are different:
#>
#> Date DB STORET STORET-DB
#> 2015-03-10 46 0 -46
#> 2015-04-08 46 0 -46
#> 2015-05-08 48 0 -48
#> 2015-06-05 48 0 -48
#> 2015-07-06 46 0 -46
#> 2015-07-15 116 0 -116
#> 2015-08-04 48 0 -48
#> 2015-08-19 110 0 -110
#> 2015-09-02 48 0 -48
#> 2015-09-16 120 0 -120
#> 2015-10-02 48 0 -48
#> 2015-10-21 120 0 -120
#> 2015-11-17 46 0 -46
#> 2015-11-18 114 0 -114
#> 2015-12-16 156 0 -156
These comparisons show that there are a number of sample events from 2015 that are in the database but not yet in STORET. This approach can thus be useful for figuring out which sample events should be exported for the next STORET upload. It’s also important to recognize that the differences shown in these individual tables do not reflect individual discrepancies, but rather that there are a handful of sampling events that have not yet been uploaded to STORET. In other words, the differences in sample counts for individual locations or characteristics simply reflect that some of the visits have not been uploaded yet.
To verify that the counts by location and characteristic match among only the visits that have been uploaded to STORET, we can simply filter the database dataset to only include sample dates that are also in the STORET dataset.
db_results <- filter(db_results, ActivityStartDate %in% unique(storet_results$ActivityStartDate))
storet_compare_count(storet = list(LocationID = storet_results$MonitoringLocationIdentifier),
db = list(LocationID = db_results$MonitoringLocationID))
#> Sample counts by ( LocationID ) are different:
#>
#> LocationID DB STORET STORET-DB
#> MYRWA-BEI093 607 615 8
#> MYRWA-BEI001 8 0 -8
storet_compare_count(storet = list(CharacteristicName = storet_results$CharacteristicName),
db = list(CharacteristicName = db_results$CharacteristicName))
#> Sample counts by ( CharacteristicName ) are the same
storet_compare_count(storet = list(Year = year(storet_results$ActivityStartDate)),
db = list(Year = year(db_results$ActivityStartDate)))
#> Sample counts by ( Year ) are the same
storet_compare_count(storet = list(Date = storet_results$ActivityStartDate),
db = list(Date = db_results$ActivityStartDate))
#> Sample counts by ( Date ) are the same
Based on these comparisons using only the dates that are in STORET, we see one issue in the counts by location ID. There were 8 samples where the LocationID appears to have been changed in the database from BEI001 to BEI093. We can then identify the specific sample date associated with this difference by doing a comparison on both Date
and LocationID
.
storet_compare_count(storet = list(Date = storet_results$ActivityStartDate,
LocationID = storet_results$MonitoringLocationIdentifier),
db = list(Date = db_results$ActivityStartDate,
LocationID = db_results$MonitoringLocationID))
#> Sample counts by ( Date, LocationID ) are different:
#>
#> Date LocationID DB STORET STORET-DB
#> 2015-01-23 MYRWA-BEI093 0 8 8
#> 2015-01-23 MYRWA-BEI001 8 0 -8
And we see that the difference in LocationID
occurs in the 2015-01-23
sample visit.
Another thing we can do is look for differences in the ActivityID
, which shows the same result. Note that the lab (L
) and field (F
) samples each have different ActivityIDs
, so there are four rows total, two for each dataset.
storet_compare_count(storet = list(ActivityID = storet_results$ActivityIdentifier),
db = list(ActivityID = db_results$ActivityID))
#> Sample counts by ( ActivityID ) are different:
#>
#> ActivityID DB STORET STORET-DB
#> MYRWA-BEI093:201501230630:F:S 0 1 1
#> MYRWA-BEI093:201501230630:L:S 0 7 7
#> MYRWA-BEI001:201501230630:F:S 1 0 -1
#> MYRWA-BEI001:201501230630:L:S 7 0 -7
We now have some useful information to identify and correct the issue. Here are the corresponding rows from the STORET and Database result tables.
filter(storet_results,
MonitoringLocationIdentifier == "MYRWA-BEI093",
ActivityStartDate == as.Date("2015-01-23")) %>%
select(ActivityIdentifier, ActivityStartDate, ActivityStartTime.Time,
MonitoringLocationIdentifier, CharacteristicName, ResultMeasureValue,
ResultMeasure.MeasureUnitCode)
#> ActivityIdentifier ActivityStartDate ActivityStartTime.Time
#> 1 MYRWA-BEI093:201501230630:L:S 2015-01-23 06:30:00
#> 2 MYRWA-BEI093:201501230630:L:S 2015-01-23 06:30:00
#> 3 MYRWA-BEI093:201501230630:L:S 2015-01-23 06:30:00
#> 4 MYRWA-BEI093:201501230630:L:S 2015-01-23 06:30:00
#> 5 MYRWA-BEI093:201501230630:L:S 2015-01-23 06:30:00
#> 6 MYRWA-BEI093:201501230630:L:S 2015-01-23 06:30:00
#> 7 MYRWA-BEI093:201501230630:L:S 2015-01-23 06:30:00
#> 8 MYRWA-BEI093:201501230630:F:S 2015-01-23 06:30:00
#> MonitoringLocationIdentifier CharacteristicName
#> 1 MYRWA-BEI093 Phosphorus
#> 2 MYRWA-BEI093 Specific conductance
#> 3 MYRWA-BEI093 Enterococcus
#> 4 MYRWA-BEI093 Dissolved oxygen (DO)
#> 5 MYRWA-BEI093 Inorganic nitrogen (nitrate and nitrite)
#> 6 MYRWA-BEI093 Dissolved oxygen (DO)
#> 7 MYRWA-BEI093 Total suspended solids
#> 8 MYRWA-BEI093 Temperature, water
#> ResultMeasureValue ResultMeasure.MeasureUnitCode
#> 1 0.053 mg/l
#> 2 45000.000 uS/cm
#> 3 NA
#> 4 10.000 mg/l
#> 5 NA
#> 6 68.512 %
#> 7 32.000 mg/l
#> 8 -0.500 deg C
filter(db_results,
MonitoringLocationID == "MYRWA-BEI001",
ActivityStartDate == as.Date("2015-01-23")) %>%
select(ActivityID, ActivityStartDate, ActivityStartTime, MonitoringLocationID,
CharacteristicName, ResultValue, ResultUnit)
#> ActivityID ActivityStartDate ActivityStartTime
#> 1 MYRWA-BEI001:201501230630:L:S 2015-01-23 06:30:00
#> 2 MYRWA-BEI001:201501230630:L:S 2015-01-23 06:30:00
#> 3 MYRWA-BEI001:201501230630:L:S 2015-01-23 06:30:00
#> 4 MYRWA-BEI001:201501230630:L:S 2015-01-23 06:30:00
#> 5 MYRWA-BEI001:201501230630:L:S 2015-01-23 06:30:00
#> 6 MYRWA-BEI001:201501230630:L:S 2015-01-23 06:30:00
#> 7 MYRWA-BEI001:201501230630:F:S 2015-01-23 06:30:00
#> 8 MYRWA-BEI001:201501230630:L:S 2015-01-23 06:30:00
#> MonitoringLocationID CharacteristicName
#> 1 MYRWA-BEI001 Phosphorus
#> 2 MYRWA-BEI001 Specific conductance
#> 3 MYRWA-BEI001 Inorganic nitrogen (nitrate and nitrite)
#> 4 MYRWA-BEI001 Dissolved oxygen (DO)
#> 5 MYRWA-BEI001 Enterococcus
#> 6 MYRWA-BEI001 Total suspended solids
#> 7 MYRWA-BEI001 Temperature, water
#> 8 MYRWA-BEI001 Dissolved oxygen (DO)
#> ResultValue ResultUnit
#> 1 0.053 mg/l
#> 2 45000.000 uS/cm
#> 3 NA mg/l
#> 4 68.512 %
#> 5 NA MPN/100ml
#> 6 32.000 mg/l
#> 7 -0.500 deg C
#> 8 10.000 mg/l