DataHub Metadata Bug Analysis and Recovery Report

Note that this document should not be confused with the DataHub Metadata Bug Analysis and Recovery Report for Re-uploaded Data that covered the May 2019 re-upload ODE crash. This document covers the metadata duplication and missing bsmSource problems that occurred from December 2018 to April 2019.

Overview

This document outlines the results obtained by running the pseudoqueries specified in the ODE Metadata Bug Impact, Resolution, and Prevention document.

S3 queries cannot be performed directly on the data, so they are proxied using the Metadata Query Function, a configurable python script.

WYDOT's production ODE instance was updated to the new version with the metadata bug fix (release 1.0.6) on April 11, 2019 at 6:30PM Mountain Time which corresponds to a timezone-naive UTC timestamp of April 12, 2019, 12:30AM. For the purposes of this document, dateOfBugFix is considered to be the end of the day on 04/11/2019 (midnight 4/12, in UTC time, i.e. 2019-04-12T00:00:00.000Z ISO format).


Note that these timestamps are inclusive; all data on December 3rd and April 11th is analyzed.

Timeline

High-Level Information

S3 Bucket:

usdot-its-cvpilot-public-data

Data Provider Source Analyzed:

wydot

Affected S3 File Prefixes:

["wydot/BSM/2018/12", "wydot/BSM/2019/01", "wydot/BSM/2019/02", "wydot/BSM/2019/03", "wydot/BSM/2019/04", "wydot/TIM/2018/12", "wydot/TIM/2019/01", "wydot/TIM/2019/02", "wydot/TIM/2019/03", "wydot/TIM/2019/04"]

S3 File Count Matching These Prefixes:

8,906

Approximate Data Size Matching These Prefixes:

6.85 GB

Pre-Cleanup Query Findings

Query NumberQuery NamePseudoqueryResultNotes
1totalRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix3,601,06728,730 records found in these files DID NOT satisfy the query constraint.
2timBroadcastRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy == TMC 95,0653,534,732 did not satisfy constraint
3goodOtherRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload AND metadata.recordGeneratedBy != TMC03,629,797 did not satisfy constraint
4badBsmRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload3,506,002123,795 did not satisfy constraint
5badOtherRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload03,629,797 did not satisfy constraint
6removableRecordCountremovableRecordCount = badBsmRecordCount + badOtherRecordCount3,506,002 + 0 = 3,506,002

7

totalRecordCount (correctness)totalRecordCount equal to (==) (timBroadcastRecordCount  + removableRecordCount + goodOtherRecordCount)

totalRecordCount = 3,601,067

timBroadcastRecordCount  + removableRecordCount + goodOtherRecordCount = 95,065 + 3,506,002 + 0 = 3,601,067

3,601,067 ==  3,601,067

(Correct)
8earliestGeneratedAt

SELECT MIN(metadata.recordGeneratedAt) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

1970-01-01 00:00:00From null data.
9latestGeneratedAt

SELECT MAX(metadata.recordGeneratedAt) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

2019-04-11 15:45:58.766000
10 invalidRecordCountinvalidRecordCount = removableRecordCount3,506,002For clarification only.
11invalidS3FileCount

SELECT COUNT(s3-filename) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

1,488 files
12invalidS3Files

SELECT s3FileName WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

invalid_s3_file_list_fixed.txt


13listOfLogFilesBefore

SELECT metadata.logFilename WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

4,754 files


invalid_log_file_list.txt


14validRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC AND metadata.recordGeneratedAt >= earliestGeneratedAt AND metadata.recordGeneratedAt <= latestGeneratedAt [AND metadata.serialId.bundleSize > 1]2,012,3821,617,415 records did not satisfy query constraints
15validS3FileCountSELECT COUNT(s3-filename) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC AND metadata.recordGeneratedAt >= earliestGeneratedAt AND metadata.recordGeneratedAt <= latestGeneratedAt [AND metadata.serialId.bundleSize > 1]920
16validS3FilesSELECT s3FileName WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1]valid_s3_file_list_fixed.txt
17listOfLogFilesAfterSELECT metadata.logFilename WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1]

3,087 files

valid_log_file_list.txt


Data Cleanup Actions


The data cleanup process has three main steps: identifying the bad data, deleting it, and confirming success. The above queries have accomplished the first step of identifying the bad data, in particular the list obtained by Query #12 - invalidS3Files. In this section we will discuss the process of safely and correctly deleting the bad data.

To prevent data loss from this process, we will activate S3 versioning for the public data bucket. This versioning feature allows us to easily revert any deletions should the need arise. In addition to activating versioning, the data will first be downloaded and the deletion process will be performed locally first. This "dry run" deletion will allow us to do a test run of the post-cleanup queries to verify that only the right files and records were deleted. 


Step 1: Identify scope of bad data

This has been completed by the queries above and in particular the list of files generated by query 12.

Step 2: Delete the bad data (dry run on local data)

For each file listed in the invalid_s3_file_list_fixed.txt above, the file in the local clone of the S3 bucket will be deleted.

Step 3: Verify cleanup (dry run on local data)

The queries in the Pseudo-queries for Validating the Clean Up section of the  ODE Metadata Bug Impact, Resolution, and Prevention document will be run locally and asserted that actual outputs match expected outputs.

If all of these queries pass, then we will move on to the next step of deleting the data in the real S3 bucket.

Step 4: Delete the bad data (real data on S3 Sandbox)

For each file listed in the invalid_s3_file_list_fixed.txt above, the file in the S3 bucket will be deleted.

Step 5: Verify cleanup (real data on S3 Sandbox)

The queries in the Pseudo-queries for Validating the Clean Up section of the  ODE Metadata Bug Impact, Resolution, and Prevention document will be run against the S3 bucket and asserted that actual outputs match expected outputs.

Post-Cleanup Query Findings

Query NumberQuery NamePseudoqueryResultExpected ResultMatches Expected?Notes
1totalRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix95,065Number of tims in the time period (95,065)Yes

36,127 did not satisfy query constraints


131,192 total records in 

2timBroadcastRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy == TMC 95,06595,065 (same as before deletion)Yes
3goodOtherRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload AND metadata.recordGeneratedBy != TMC0n/aYes
4badBsmRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload00Yes
5badOtherRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload00Yes
6removableRecordCountremovableRecordCount = badBsmRecordCount + badOtherRecordCount00Yes

7

totalRecordCount (correctness)totalRecordCount equal to (==) (timBroadcastRecordCount  + removableRecordCount + goodOtherRecordCount)

totalRecordCount = 95,065

timBroadcastRecordCount  + removableRecordCount + goodOtherRecordCount = 95,065 + 0 + 0 = 95,065


(expected to be correct)Yes
8earliestGeneratedAt

SELECT MIN(metadata.recordGeneratedAt) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

(None)

No records should be found

YesThere are no records matching the query constraints, so there are no timestamps to be compared. This matches the expectation that after the deletion is complete, no records should be found.
9latestGeneratedAt

SELECT MAX(metadata.recordGeneratedAt) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

(None)

No records should be found

YesThere are no records matching the query constraints, so there are no timestamps to be compared. This matches the expectation that after the deletion is complete, no records should be found.
10 invalidRecordCountinvalidRecordCount = removableRecordCount00Yes
11invalidS3FileCount

SELECT COUNT(s3-filename) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

0 - empty list0Yes
12invalidS3Files

SELECT s3FileName WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

0 - empty list

(empty list)Yes
13listOfLogFilesBefore

SELECT metadata.logFilename WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

0 - empty list

0

(empty list)
Yes
14validRecordCountSELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC AND metadata.recordGeneratedAt >= earliestGeneratedAt AND metadata.recordGeneratedAt <= latestGeneratedAt [AND metadata.serialId.bundleSize > 1]00Yes
15validS3FileCountSELECT COUNT(s3-filename) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC AND metadata.recordGeneratedAt >= earliestGeneratedAt AND metadata.recordGeneratedAt <= latestGeneratedAt [AND metadata.serialId.bundleSize > 1]0 - empty list0 - empty listYes
16validS3FilesSELECT s3FileName WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1]0 - empty list0 - empty listYes
17listOfLogFilesAfterSELECT metadata.logFilename WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1]

0 - empty list

0 - empty listYes