The myrwaR
package includes a set of functions for retrieving data from the MyRWA Water Quality Access Database. First, we need to load the package:
library(myrwaR)
To connect to the database, use db_connect()
and specify the path to the database, which returns a connection handler that is later used to retrieve the data.
base_dir <- 'D:/Dropbox/Work/mystic/db'
db_path <- file.path(base_dir, "MysticDB_20160208.accdb")
ch <- db_connect(path = db_path)
The db_list_tables()
function lists the names of tables in the database (note that dplyr
has a function with the same name, so you may need to namespace it with myrwaR::
):
myrwaR::db_list_tables(ch)
#> [1] "20130622_VisitPrecip" "20130925_TableauAnnualMeans"
#> [3] "20140510_VisitPrecip" "ActionLimit"
#> [5] "Catchment_DB" "Characteristic"
#> [7] "CharacteristicNominal" "IMPORT_LOCATIONS"
#> [9] "IMPORT_NOMINAL" "IMPORT_NUMERIC"
#> [11] "IMPORT_STORM_VISIT" "IMPORT_VISIT"
#> [13] "Location" "LocationType"
#> [15] "Method" "Municipality"
#> [17] "Project" "Record_DataImport"
#> [19] "Reference" "Result"
#> [21] "ResultFlag" "ResultNominal"
#> [23] "SampleFraction" "SampleType"
#> [25] "SSO" "StormEvent"
#> [27] "StormEventVisit" "StormFlag"
#> [29] "StormMethod" "Units"
#> [31] "Visit" "WaterBody"
To load the data from any one of these tables, use db_table()
. For example, to get the Location
table:
locations <- db_table(ch, table_name = "Location")
str(locations)
#> 'data.frame': 2544 obs. of 16 variables:
#> $ ID : Factor w/ 2544 levels "1395MA","2ndRd",..: 1 2 3 4 5 6 7 8 9 10 ...
#> $ LocationDescription: Factor w/ 372 levels ""," DS from Mass Ave, where brook crosses road, near Wilson Farms",..: NA NA NA 1 NA NA 37 NA 166 1 ...
#> $ WaterBodyID : Factor w/ 77 levels "Aberjona River",..: 46 25 46 1 1 1 1 1 31 1 ...
#> $ MunicipalityID : Factor w/ 22 levels "Arlington","Belmont",..: 1 17 1 20 20 20 20 20 20 20 ...
#> $ Latitude : num 42.4 42.5 42.4 42.4 42.4 ...
#> $ Longitude : num -71.2 -71.1 -71.2 -71.1 -71.1 ...
#> $ LocationTypeID : int 22 22 22 22 22 22 22 22 20 22 ...
#> $ LocationMethod : Factor w/ 5 levels "GPS-Unspecified",..: 1 1 1 1 2 2 1 1 1 1 ...
#> $ CoordinateSystem : Factor w/ 2 levels "","WGS84": 2 2 2 2 2 2 2 2 2 2 ...
#> $ InWatershed : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ RegionalID : Factor w/ 2 levels "Lower Mystic River",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ WaterType : Factor w/ 2 levels "Fresh","Saline": 1 1 1 1 1 1 1 1 1 1 ...
#> $ Owner : Factor w/ 11 levels "","Arlington",..: NA NA NA 1 NA NA NA NA NA 1 ...
#> $ StaffGage : int 0 0 0 0 0 0 1 0 0 0 ...
#> $ County : Factor w/ 2 levels "Middlesex","Suffolk": 1 1 1 1 1 1 1 1 1 1 ...
#> $ CatchmentArea_Sqmi : num NA NA NA NA NA NA NA NA NA NA ...
If a database connection has already been created, water quality data can be retrieved using db_results()
. This function will automatically join the Visit
and Result
tables to provide the location IDs, datetimes, and useful attributes for each row in the Result
table. Note that this function will not join the Location
table, and thus not contain latitudes and longitudes. See the load_wq()
function in the next section.
results <- db_results(ch)
#> Warning: closing unused RODBC handle 8
Finally, to connect to the database and retrieve the water quality data with the location information, use the load_wq()
function (note that the name of this function differs from other database functions to maintain consistency with older R scripts used by MyRWA). The load_wq()
function will connect to the database, and then merge the Result
, Visit
, and Location
database tables into a single dataframe.
wq <- load_wq(path = db_path, sample_types = c("S"), exclude_flags = TRUE)
str(wq)
#> 'data.frame': 183279 obs. of 33 variables:
#> $ LocationID : Factor w/ 1013 levels "ABR001","ABR002",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ VisitID : int 28280 26540 28280 28280 26540 28280 27527 26540 26540 26879 ...
#> $ ID : int 140905 133640 140906 140901 133641 140902 137608 133638 133642 135375 ...
#> $ CharacteristicID : Factor w/ 68 levels "ALK","ARSENIC",..: 17 16 43 48 15 15 15 19 48 17 ...
#> $ ResultValue : num 448 57.57 717 717 5.38 ...
#> $ Units : Factor w/ 13 levels "%","CFU/100ml",..: 2 1 10 13 6 6 6 2 13 2 ...
#> $ Qualifier : Factor w/ 3 levels "","<",">": NA NA NA NA NA NA NA NA NA NA ...
#> $ FlagID : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
#> $ MethodID : int 3 24 69 71 24 24 24 17 71 2 ...
#> $ ResultComment : Factor w/ 48 levels ""," ","(bacteria) perservative (blue pill) fell out of one of bottles",..: NA NA NA NA NA NA NA NA NA NA ...
#> $ SampleDepth : num NA NA NA NA NA NA NA NA NA NA ...
#> $ UniqueID : Factor w/ 25751 levels "197703211400_WQ77_22_DEP_S",..: 16404 7329 16404 16404 7329 16404 11131 7329 7329 8887 ...
#> $ Datetime : POSIXct, format: "2009-08-26 06:59:00" "2003-06-24 05:18:00" ...
#> $ ProjectID : Factor w/ 9 levels "BASE","BHWQM",..: 5 5 5 5 5 5 5 5 5 5 ...
#> $ SampleTypeID : Factor w/ 1 level "S": 1 1 1 1 1 1 1 1 1 1 ...
#> $ HasFlow : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ SampleDepthType : Factor w/ 2 levels "B","S": 2 2 2 2 2 2 2 2 2 2 ...
#> $ Comment : Factor w/ 2086 levels ""," "," Tributary",..: NA NA NA NA NA NA NA NA NA 674 ...
#> $ LocationDescription: Factor w/ 302 levels ""," DS from Mass Ave, where brook crosses road, near Wilson Farms",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ WaterBodyID : Factor w/ 63 levels "Aberjona River",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ MunicipalityID : Factor w/ 22 levels "Arlington","Belmont",..: 20 20 20 20 20 20 20 20 20 20 ...
#> $ Latitude : num 42.4 42.4 42.4 42.4 42.4 ...
#> $ Longitude : num -71.1 -71.1 -71.1 -71.1 -71.1 ...
#> $ LocationTypeID : int 22 22 22 22 22 22 22 22 22 22 ...
#> $ LocationMethod : Factor w/ 5 levels "GPS-Unspecified",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ CoordinateSystem : Factor w/ 2 levels "","WGS84": 2 2 2 2 2 2 2 2 2 2 ...
#> $ InWatershed : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ RegionalID : Factor w/ 2 levels "Lower Mystic River",..: 2 2 2 2 2 2 2 2 2 2 ...
#> $ WaterType : Factor w/ 2 levels "Fresh","Saline": 1 1 1 1 1 1 1 1 1 1 ...
#> $ Owner : Factor w/ 9 levels "","Arlington",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ StaffGage : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ County : Factor w/ 2 levels "Middlesex","Suffolk": 1 1 1 1 1 1 1 1 1 1 ...
#> $ CatchmentArea_Sqmi : num NA NA NA NA NA NA NA NA NA NA ...
The sample_types
parameter lets you specify which sample types to include in the data frame. By default, it will only return routine samples ("S"
). To include other sample types such as field duplicates and field blanks, just add other SampleType IDs sample_types = c("S", "FD", "FB")
, or set sample_types = NULL
to retrieve all sample types. To see a complete list of sample types, load the SampleType
table: db_table(ch, table_name="SampleType")
.
The exclude_flags
parameter specifies whether the resulting data frame should exclude flagged samples (default is TRUE
). To include the flagged samples, set this to FALSE
.
To close a connection, simply use the close()
function:
close(ch)
The compare_database()
function summarizes changes between to database versions. This function loads the Result
, Visit
, and Location
tables from each database version, and then summarizes how many rows have been added or removed from the first to the second version. Note that this does not look at changes to the values of each row, it only checks for row additions and removals.
compare_database(old_path = file.path(base_dir, "MysticDB_20160120.accdb"),
new_path = file.path(base_dir, "MysticDB_20160208.accdb"))
#> MyRWA WQ Database Comparison
#> Old DB: MysticDB_20160120.accdb
#> New DB: MysticDB_20160208.accdb
#> ================================================================================
#> Result Table
#>
#> # Rows Added: 799
#> Summary of Added Results:
#> ID CharacteristicID Units
#> Min. :446011 SPCOND :100 mg/l :375
#> 1st Qu.:446211 DO : 98 MPN/100ml:100
#> Median :446410 TEMP_WATER: 96 uS/cm :100
#> Mean :446410 DO_SAT : 94 deg C : 96
#> 3rd Qu.:446610 NO23 : 77 % : 94
#> Max. :446809 TP : 77 PSU : 23
#> (Other) :257 (Other) : 11
#>
#> # Rows Removed: 46
#> Summary of Removed Results:
#> ID CharacteristicID Units
#> Min. :278223 DO : 6 mg/l :24
#> 1st Qu.:278234 ENT : 6 MPN/100ml: 6
#> Median :278246 NO23 : 6 uS/cm : 6
#> Mean :278246 SPCOND : 6 % : 5
#> 3rd Qu.:278257 TP : 6 deg C : 5
#> Max. :278268 TSS : 6 CFU/100ml: 0
#> (Other):10 (Other) : 0
#> ================================================================================
#> Visit Table
#>
#> # Rows Added: 102
#> ProjectIDs with Added Visits: BASE HOTSPOT
#> LocationIDs with Added Visits: ABR006 ABR028 ABR031 ABR036 ABR049 ALB006 ARL007 ARL010 ARL013 ARL014 ARL026 ARL027 ARL028 ARL136 BEI001 CHR95S MAR036 MEB001 MIB001 MIC004 MYR071 MYR275 MYRMMP SOMD05 SOMD07 SOMD08 SOMD09 UPL001 WIB001 WIN48WASH WINSWANGP WINWEDGE WINwse5 WINx03 WOB70-1
#> Summary of Added Visits:
#> ID Datetime LocationID ProjectID
#> Min. :137296 Min. :2015-01-23 06:20:00 MYR275 :14 BASE :77
#> 1st Qu.:137321 1st Qu.:2015-06-05 06:21:00 MYRMMP :14 HOTSPOT:25
#> Median :137347 Median :2015-10-02 07:03:30 BEI001 :13
#> Mean :137348 Mean :2015-08-26 10:04:27 CHR95S :13
#> 3rd Qu.:137378 3rd Qu.:2015-12-02 10:28:45 MIC004 :12
#> Max. :137403 Max. :2015-12-16 07:23:00 ABR006 : 2
#> (Other):34
#>
#> # Rows Removed: 6
#> ProjectIDs with Removed Visits: BASE
#> LocationIDs with Removed Visits: BEI093 CHR95S MIC004 MYR275 MYRMMP
#> Summary of Removed Visits:
#> ID Datetime LocationID ProjectID
#> Min. :43393 Min. :2015-01-23 06:20:00 BEI093:1 BASE:6
#> 1st Qu.:43394 1st Qu.:2015-01-23 06:20:00 CHR95S:1
#> Median :43396 Median :2015-01-23 06:24:00 MIC004:1
#> Mean :43396 Mean :2015-01-23 06:25:10 MYR275:2
#> 3rd Qu.:43397 3rd Qu.:2015-01-23 06:29:30 MYRMMP:1
#> Max. :43398 Max. :2015-01-23 06:33:00
#> ================================================================================
#> Location Table
#>
#> # Rows Added: 0
#>
#> # Rows Removed: 0
The output of the compare_database()
function can also be saves to a text file by setting the log_file
argument.
compare_database(old_path = file.path(base_dir, "MysticDB_20150227.accdb"),
new_path = file.path(base_dir, "MysticDB_20150529.accdb"),
log_file = file.path(base_dir, 'compare_20150227_20150529.txt'))
The db_qaqc_suite()
function runs a series of checks on the Result
and Visit
tables These checks are designed to find invalid values in each column of these tables. The types of checks include:
""
)"FD "
)NA
in columns that should not have missing valuesResult.Units->Units.ID
, Result.FlagID->ResultFlag.ID
, Visit.SampleTypeID->SampleType.ID
, and Visit.ProjectID->Project.ID
)Note that the first three checks are included because Microsoft Access does not enforce consistency across these issues and thus may go unnoticed within the database. However, these issues can cause problems when importing the results into R, which does require these consistencies.
For each set of checks, db_qaqc_suite()
function will report which columns failed, the reason, and the number of rows that failed each check. The specific rows that failed can be shown by setting print.values=TRUE
(not shown here for brevity).
ch <- db_connect(db_path)
db_qaqc_suite(ch, print.rows=FALSE)
#> Loading tables...done
#>
#> Checking Result table...
#> .. Checking for empty strings
#> .... CharacteristicID ...OK
#> .... Units ...OK
#> .... Qualifier ...FAILED
#>
#> ERROR: There are 600 row(s) with empty strings in column "Qualifier"
#>
#> .... FlagID ...OK
#> .. Checking for lowercase strings
#> .... CharacteristicID ...OK
#> .... FlagID ...FAILED
#>
#> ERROR: There are 2 row(s) with lowercase strings in column "FlagID"
#>
#> .. Checking for untrimmed strings
#> .... CharacteristicID ...OK
#> .... Units ...OK
#> .... Qualifier ...OK
#> .... FlagID ...FAILED
#>
#> ERROR: There are 34 row(s) with untrimmed strings in column "FlagID"
#>
#> .. Checking for invalid units
#> .... Units ...OK
#> .. Checking for invalid flags
#> .... FlagID ...FAILED
#>
#> ERROR: There are 36 row(s) with invalid flags in column "FlagID"
#>
#> Checking Visit table...
#> .. Checking for empty strings
#> .... UniqueID ...OK
#> .... LocationID ...OK
#> .... ProjectID ...OK
#> .... SampleTypeID ...OK
#> .... SampleDepthType ...OK
#> .. Checking for lowercase strings
#> .... ProjectID ...OK
#> .... SampleTypeID ...OK
#> .... SampleDepthType ...OK
#> .. Checking for untrimmed strings
#> .... UniqueID ...FAILED
#>
#> ERROR: There are 12 row(s) with untrimmed strings in column "UniqueID"
#>
#> .... LocationID ...OK
#> .... ProjectID ...OK
#> .... SampleTypeID ...FAILED
#>
#> ERROR: There are 12 row(s) with untrimmed strings in column "SampleTypeID"
#>
#> .... SampleDepthType ...OK
#> .. Checking for NA values
#> .... UniqueID ...OK
#> .... LocationID ...OK
#> .... ProjectID ...OK
#> .... SampleTypeID ...FAILED
#>
#> ERROR: There are 9 row(s) with NA values in column "SampleTypeID"
#>
#> .... SampleDepthType ...FAILED
#>
#> ERROR: There are 8774 row(s) with NA values in column "SampleDepthType"
#>
#> .. Checking for datetimes out-of-range
#> .... Datetime ...OK
#> .. Checking for invalid sample type IDs
#> .... SampleTypeID ...FAILED
#>
#> ERROR: There are 12 row(s) with invalid sample type IDs in column "SampleTypeID"
#>
#> .. Checking for invalid project IDs
#> .... ProjectID ...OK
#> Warning in db_qaqc_suite(ch, print.rows = FALSE): Database did not pass
#> validation checks, see details above
close(ch)