Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

This document outlines the results obtained by running the pseudoqueries specified in the ODE Metadata Bug Impact, Resolution, and Prevention document.

S3 queries cannot be performed directly on the data, so they are proxied using the Metadata Query Function, a configurable python script.

WYDOT's production ODE instance was updated to the new version with the metadata bug fix (release 1.0.6) on April 11, 2019 at 6:30PM Mountain Time which corresponds to a timezone-naive UTC timestamp of April 12, 2019, 12:30AM. For the purposes of this document, dateOfBugFix is considered to be the end of the day on 04/11/2019 (midnight 4/12, in UTC time, i.e. 2019-04-12T00:00:00.000Z ISO format).

Note that these timestamps are inclusive; all data on December 3rd and April 11th is analyzed.

Timeline

Image Removed

High-Level Information

S3 Bucket:

usdot-its-cvpilot-public-data

Data Provider Source Analyzed:

wydot

Affected S3 File Prefixes:

["wydot/BSM/2018/12", "wydot/BSM/2019/01", "wydot/BSM/2019/02", "wydot/BSM/2019/03", "wydot/BSM/2019/04", "wydot/TIM/2018/12", "wydot/TIM/2019/01", "wydot/TIM/2019/02", "wydot/TIM/2019/03", "wydot/TIM/2019/04"]

S3 File Count Matching These Prefixes:

8,906

Approximate Data Size Matching These Prefixes:

6.85 GB

Findings

...

7

...

totalRecordCount = 3,601,067

...

SELECT MIN(metadata.recordGeneratedAt) WHERE 

...

Overview

This document outlines the results obtained by running the pseudoqueries specified in the ODE Metadata Bug Impact, Resolution, and Prevention document.

S3 queries cannot be performed directly on the data, so they are proxied using the Metadata Query Function, a configurable python script.

WYDOT's production ODE instance was updated to the new version with the metadata bug fix (release 1.0.6) on April 11, 2019 at 6:30PM Mountain Time which corresponds to a timezone-naive UTC timestamp of April 12, 2019, 12:30AM. For the purposes of this document, dateOfBugFix is considered to be the end of the day on 04/11/2019 (midnight 4/12, in UTC time, i.e. 2019-04-12T00:00:00.000Z ISO format).


Note that these timestamps are inclusive; all data on December 3rd and April 11th is analyzed.

Timeline

Image Added

High-Level Information

S3 Bucket:

usdot-its-cvpilot-public-data

Data Provider Source Analyzed:

wydot

Affected S3 File Prefixes:

["wydot/BSM/2018/12", "wydot/BSM/2019/01", "wydot/BSM/2019/02", "wydot/BSM/2019/03", "wydot/BSM/2019/04", "wydot/TIM/2018/12", "wydot/TIM/2019/01", "wydot/TIM/2019/02", "wydot/TIM/2019/03", "wydot/TIM/2019/04"]

S3 File Count Matching These Prefixes:

8,906

Approximate Data Size Matching These Prefixes:

6.85 GB

Findings

Query NumberQuery NamePseudoqueryResultNotes
1totalRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix3,601,06728,730 records found in these files DID NOT satisfy the query constraint.
2timBroadcastRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy == TMC 95,0653,534,732 did not satisfy constraint
3goodOtherRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload AND metadata.recordGeneratedBy != TMC03,629,797 did not satisfy constraint
4badBsmRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload3,506,002123,795 did not satisfy constraint
5badOtherRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload03,629,797 did not satisfy constraint
6removableRecordCountremovableRecordCount = badBsmRecordCount + badOtherRecordCount3,506,002 + 0 = 3,506,002

7

totalRecordCount (correctness)totalRecordCount equal to (==) (timBroadcastRecordCount  + removableRecordCount + goodOtherRecordCount)

totalRecordCount = 3,601,067

timBroadcastRecordCount  + removableRecordCount + goodOtherRecordCount = 95,065 + 3,506,002 + 0 = 3,601,067

3,601,067 ==  3,601,067

(Correct)
8earliestGeneratedAt

SELECT MIN(metadata.recordGeneratedAt) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

1970-01-01 00:00:00From null data.
9latestGeneratedAt

SELECT MAX(metadata.recordGeneratedAt) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

2019-04-11 15:45:58.766000
10 invalidRecordCountinvalidRecordCount = removableRecordCount3,506,002For clarification only.
11invalidS3FileCount

SELECT COUNT(s3-filename) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

1,488 files
12invalidS3Files

SELECT s3FileName WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

invalid_s3_file_list_fixed.txt


13listOfLogFilesBefore

SELECT metadata.logFilename WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

4,754 files


invalid_log_file_list.txt


14validRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC AND metadata.recordGeneratedAt >= earliestGeneratedAt AND metadata.recordGeneratedAt <= latestGeneratedAt [AND metadata.serialId.bundleSize > 1]2,012,3821,617,415 records did not satisfy query constraints
15validS3FileCountSELECT COUNT(s3-filename) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC AND metadata.recordGeneratedAt >= earliestGeneratedAt AND metadata.recordGeneratedAt <= latestGeneratedAt [AND metadata.serialId.bundleSize > 1]920
16validS3FilesSELECT s3FileName WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1]valid_s3_file_list_fixed.txt
17listOfLogFilesAfterSELECT metadata.logFilename WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1]

3,087 files

valid_log_file_list.txt



Data Cleanup Actions


The data cleanup process has three main steps: identifying the bad data, deleting it, and confirming success. The above queries have accomplished the first step of identifying the bad data, in particular the list obtained by Query #12 - invalidS3Files. In this section we will discuss the process of safely and correctly deleting the bad data.

To prevent data loss from this process, we will activate S3 versioning for the public data bucket. This versioning feature allows us to easily revert any deletions should the need arise. In addition to activating versioning, the data will first be downloaded and the deletion process will be performed locally first. This "dry run" deletion will allow us to do a test run of the post-cleanup queries to verify that only the right files and records were deleted. 


Step 1: Identify scope of bad data

This has been completed by the queries above and in particular the list of files generated by query 12.

Step 2: Delete the bad data (dry run on local data)

For each file listed in the invalid_s3_file_list_fixed.txt above, the file in the local clone of the S3 bucket will be deleted.

Step 3: Verify cleanup (dry run on local data)

The queries in the Pseudo-queries for Validating the Clean Up section of the  ODE Metadata Bug Impact, Resolution, and Prevention document will be run locally and asserted that actual outputs match expected outputs.

If all of these queries pass, then we will move on to the next step of deleting the data in the real S3 bucket.

Step 4: Delete the bad data (real data on S3 Sandbox)

For each file listed in the invalid_s3_file_list_fixed.txt above, the file in the S3 bucket will be deleted.

Step 5: Verify cleanup (real data on S3 Sandbox)

The queries in the Pseudo-queries for Validating the Clean Up section of the  ODE Metadata Bug Impact, Resolution, and Prevention document will be run against the S3 bucket and asserted that actual outputs match expected outputs.

Post-Cleanup Query Findings

Query NumberQuery NamePseudoqueryResultExpected ResultNotes
1totalRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix
n/a
2timBroadcastRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/
2018 AND 
2018 AND metadata.
odeReceivedAt <= 2/12/2019 AND 
odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy
!= TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR(metadata
== TMC 
95,065 (same as before deletion)
3goodOtherRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >=
 12
 2/
3
13/
2018 AND metadata
2019 AND metadata.odeReceivedAt <
dateOfBugFix AND metadata
dateOfBugFix AND metadata.payloadType !=
=
us.dot.its.jpo.ode.model.OdeBsmPayload
)1970-01-01 00:00:00From null data.9latestGeneratedAt

SELECT MAX(metadata.recordGeneratedAt) WHERE 

(
AND metadata.recordGeneratedBy != TMC
n/a
4badBsmRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >=
 
12/3/
2018 
2018 AND
 
metadata.
odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType !
odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload
) OR(

0
5badOtherRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >=
 
12/3/2018
 AND 
AND metadata.odeReceivedAt <
dateOfBugFix AND metadata.payloadType =
= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload
)2019-04-11 15:45:58.76600010 invalidRecordCountinvalidRecordCount = removableRecordCount3,506,002For clarification only.11invalidS3FileCountSELECT COUNT(s3-filename) WHERE 

0
6removableRecordCountremovableRecordCount = badBsmRecordCount + badOtherRecordCount
0

7

totalRecordCount (correctness)totalRecordCount equal to (==) (timBroadcastRecordCount  + removableRecordCount + goodOtherRecordCount)

totalRecordCount = 

timBroadcastRecordCount  + removableRecordCount + goodOtherRecordCount 


(expected to be correct)
8earliestGeneratedAt

SELECT MIN(metadata.recordGeneratedAt) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model

.OdeBsmPayload)1,488 files12invalidS3FilesSELECT s3FileName WHERE SELECT metadata.logFilename 

.OdeBsmPayload)


1970-01-01 00:00:00

(same as before deletion)


9latestGeneratedAt

SELECT MAX(metadata.recordGeneratedAt) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

invalid_s3_file_list_fixed.txt

13listOfLogFilesBefore

2019-04-11 15:45:58.766000

(same as before deletion)


10 invalidRecordCountinvalidRecordCount = removableRecordCount
0
11invalidS3FileCount

SELECT COUNT(s3-filename) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model

.OdeBsmPayload) OR

.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)


0
12invalidS3Files

SELECT s3FileName WHERE 

(metadata.odeReceivedAt >= 12/3/

2018

2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.

odeReceivedAt < dateOfBugFix AND metadata

recordGeneratedBy != TMC AND metadata.payloadType

=

!= us.dot.its.jpo.ode.model.OdeBsmPayload)

4,754 files

invalid_log_file_list.txt

14validRecordCountSELECT COUNT(*) WHERE 

OR

(metadata.odeReceivedAt >=

 2

 12/

13

3/

2019

2018 AND metadata.odeReceivedAt < dateOfBugFix

 AND metadata.recordGeneratedBy != TMC AND metadata.recordGeneratedAt >= earliestGeneratedAt AND metadata.recordGeneratedAt <= latestGeneratedAt [AND metadata.serialId.bundleSize > 1]2,012,3821,617,415 records did not satisfy query constraints15validS3FileCountSELECT COUNT(s3-filename) WHERE 

 AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)


(empty list)
13listOfLogFilesBefore

SELECT metadata.logFilename WHERE 

(metadata.odeReceivedAt >=

 2

 12/

13/2019 AND 

3/2018 AND metadata.

odeReceivedAt < dateOfBugFix AND

odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy !=

 TMC 

 TMC AND metadata.

recordGeneratedAt >= earliestGeneratedAt AND metadata.recordGeneratedAt <= latestGeneratedAt [AND metadata.serialId.bundleSize > 1]92016validS3FilesSELECT s3FileName WHERE 

payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >=

 2

 12/

13

3/

2019

2018 AND metadata.odeReceivedAt < dateOfBugFix

 AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1]valid_s3_file_list_fixed.txt17listOfLogFilesAfterSELECT metadata.logFilename WHERE 

 AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)


0

(empty list)

14validRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy !=
 TMC [AND metadata.serialId.bundleSize > 1]

3,087 files

valid_log_file_list.txt

Data Cleanup Actions

The data cleanup process has three main steps: identifying the bad data, deleting it, and confirming success. The above queries have accomplished the first step of identifying the bad data, in particular the list obtained by Query #12 - invalidS3Files. In this section we will discuss the process of safely and correctly deleting the bad data.

To prevent data loss from this process, we will activate S3 versioning for the public data bucket. This versioning feature allows us to easily revert any deletions should the need arise. In addition to activating versioning, the data will first be downloaded and the deletion process will be performed locally first. This "dry run" deletion will allow us to do a test run of the post-cleanup queries to verify that only the right files and records were deleted. 

Step 1: Identify scope of bad data

This has been completed by the queries above and in particular the list of files generated by query 12.

Step 2: Delete the bad data (dry run on local data)

For each file listed in the invalid_s3_file_list_fixed.txt above, the file in the local clone of the S3 bucket will be deleted.

Step 3: Verify cleanup (dry run on local data)

The queries in the Pseudo-queries for Validating the Clean Up section of the  ODE Metadata Bug Impact, Resolution, and Prevention document will be run locally and asserted that actual outputs match expected outputs.

If all of these queries pass, then we will move on to the next step of deleting the data in the real S3 bucket.

Step 4: Delete the bad data (real data on S3 Sandbox)

For each file listed in the invalid_s3_file_list_fixed.txt above, the file in the S3 bucket will be deleted.

Step 5: Verify cleanup (real data on S3 Sandbox)

...

 TMC AND metadata.recordGeneratedAt >= earliestGeneratedAt AND metadata.recordGeneratedAt <= latestGeneratedAt [AND metadata.serialId.bundleSize > 1]


15validS3FileCountSELECT COUNT(s3-filename) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC AND metadata.recordGeneratedAt >= earliestGeneratedAt AND metadata.recordGeneratedAt <= latestGeneratedAt [AND metadata.serialId.bundleSize > 1]


16validS3FilesSELECT s3FileName WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1]


17listOfLogFilesAfterSELECT metadata.logFilename WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1]