DataHub Metadata Bug Analysis and Recovery Report
Note that this document should not be confused with the DataHub Metadata Bug Analysis and Recovery Report for Re-uploaded Data that covered the May 2019 re-upload ODE crash. This document covers the metadata duplication and missing bsmSource problems that occurred from December 2018 to April 2019.
Overview
This document outlines the results obtained by running the pseudoqueries specified in the ODE Metadata Bug Impact, Resolution, and Prevention document.
S3 queries cannot be performed directly on the data, so they are proxied using the Metadata Query Function, a configurable python script.
WYDOT's production ODE instance was updated to the new version with the metadata bug fix (release 1.0.6) on April 11, 2019 at 6:30PM Mountain Time which corresponds to a timezone-naive UTC timestamp of April 12, 2019, 12:30AM. For the purposes of this document, dateOfBugFix is considered to be the end of the day on 04/11/2019 (midnight 4/12, in UTC time, i.e. 2019-04-12T00:00:00.000Z ISO format).
Note that these timestamps are inclusive; all data on December 3rd and April 11th is analyzed.
Timeline
High-Level Information
S3 Bucket:
usdot-its-cvpilot-public-data
Data Provider Source Analyzed:
wydot
Affected S3 File Prefixes:
["wydot/BSM/2018/12", "wydot/BSM/2019/01", "wydot/BSM/2019/02", "wydot/BSM/2019/03", "wydot/BSM/2019/04", "wydot/TIM/2018/12", "wydot/TIM/2019/01", "wydot/TIM/2019/02", "wydot/TIM/2019/03", "wydot/TIM/2019/04"]
S3 File Count Matching These Prefixes:
8,906
Approximate Data Size Matching These Prefixes:
6.85 GB
Pre-Cleanup Query Findings
Query Number | Query Name | Pseudoquery | Result | Notes |
---|---|---|---|---|
1 | totalRecordCount | SELECT COUNT(*) WHERE AND
| 3,601,067 | 28,730 records found in these files DID NOT satisfy the query constraint. |
2 | timBroadcastRecordCount | SELECT COUNT(*) WHERE | 95,065 | 3,534,732 did not satisfy constraint |
3 | goodOtherRecordCount | SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload AND metadata.recordGeneratedBy != TMC | 0 | 3,629,797 did not satisfy constraint |
4 | badBsmRecordCount | SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload | 3,506,002 | 123,795 did not satisfy constraint |
5 | badOtherRecordCount | SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload | 0 | 3,629,797 did not satisfy constraint |
6 | removableRecordCount | removableRecordCount = badBsmRecordCount + badOtherRecordCount | 3,506,002 + 0 = 3,506,002 | |
7 | totalRecordCount (correctness) | totalRecordCount equal to (==) (timBroadcastRecordCount + removableRecordCount + goodOtherRecordCount) | totalRecordCount = 3,601,067 timBroadcastRecordCount + removableRecordCount + goodOtherRecordCount = 95,065 + 3,506,002 + 0 = 3,601,067 | (Correct) |
8 | earliestGeneratedAt |
| 1970-01-01 00:00:00 | From null data. |
9 | latestGeneratedAt |
| 2019-04-11 15:45:58.766000 | |
10 | invalidRecordCount | invalidRecordCount = removableRecordCount | 3,506,002 | For clarification only. |
11 | invalidS3FileCount |
| 1,488 files | |
12 | invalidS3Files |
| ||
13 | listOfLogFilesBefore |
| 4,754 files | |
14 | validRecordCount | SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND AND metadata.recordGeneratedBy != TMC AND
| 2,012,382 | 1,617,415 records did not satisfy query constraints |
15 | validS3FileCount | SELECT COUNT(s3-filename) WHERE metadata.odeReceivedAt >= 2/13/2019 AND AND metadata.recordGeneratedBy != TMC AND [ | 920 | |
16 | validS3Files | SELECT s3FileName WHERE AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1 ] | valid_s3_file_list_fixed.txt | |
17 | listOfLogFilesAfter | SELECT metadata.logFilename WHERE AND metadata.recordGeneratedBy != TMC [ | 3,087 files |
Data Cleanup Actions
The data cleanup process has three main steps: identifying the bad data, deleting it, and confirming success. The above queries have accomplished the first step of identifying the bad data, in particular the list obtained by Query #12 - invalidS3Files. In this section we will discuss the process of safely and correctly deleting the bad data.
To prevent data loss from this process, we will activate S3 versioning for the public data bucket. This versioning feature allows us to easily revert any deletions should the need arise. In addition to activating versioning, the data will first be downloaded and the deletion process will be performed locally first. This "dry run" deletion will allow us to do a test run of the post-cleanup queries to verify that only the right files and records were deleted.
Step 1: Identify scope of bad data
This has been completed by the queries above and in particular the list of files generated by query 12.
Step 2: Delete the bad data (dry run on local data)
For each file listed in the invalid_s3_file_list_fixed.txt
above, the file in the local clone of the S3 bucket will be deleted.
Step 3: Verify cleanup (dry run on local data)
The queries in the Pseudo-queries for Validating the Clean Up section of the ODE Metadata Bug Impact, Resolution, and Prevention document will be run locally and asserted that actual outputs match expected outputs.
If all of these queries pass, then we will move on to the next step of deleting the data in the real S3 bucket.
Step 4: Delete the bad data (real data on S3 Sandbox)
For each file listed in the invalid_s3_file_list_fixed.txt
above, the file in the S3 bucket will be deleted.
Step 5: Verify cleanup (real data on S3 Sandbox)
The queries in the Pseudo-queries for Validating the Clean Up section of the ODE Metadata Bug Impact, Resolution, and Prevention document will be run against the S3 bucket and asserted that actual outputs match expected outputs.
Post-Cleanup Query Findings
Query Number | Query Name | Pseudoquery | Result | Expected Result | Matches Expected? | Notes |
---|---|---|---|---|---|---|
1 | totalRecordCount | SELECT COUNT(*) WHERE AND
| 95,065 | Number of tims in the time period (95,065) | Yes | 36,127 did not satisfy query constraints 131,192 total records in |
2 | timBroadcastRecordCount | SELECT COUNT(*) WHERE | 95,065 | 95,065 (same as before deletion) | Yes | |
3 | goodOtherRecordCount | SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload AND metadata.recordGeneratedBy != TMC | 0 | n/a | Yes | |
4 | badBsmRecordCount | SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload | 0 | 0 | Yes | |
5 | badOtherRecordCount | SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload | 0 | 0 | Yes | |
6 | removableRecordCount | removableRecordCount = badBsmRecordCount + badOtherRecordCount | 0 | 0 | Yes | |
7 | totalRecordCount (correctness) | totalRecordCount equal to (==) (timBroadcastRecordCount + removableRecordCount + goodOtherRecordCount) | totalRecordCount = 95,065 timBroadcastRecordCount + removableRecordCount + goodOtherRecordCount = 95,065 + 0 + 0 = 95,065 | (expected to be correct) | Yes | |
8 | earliestGeneratedAt |
| (None) | No records should be found | Yes | There are no records matching the query constraints, so there are no timestamps to be compared. This matches the expectation that after the deletion is complete, no records should be found. |
9 | latestGeneratedAt |
| (None) | No records should be found | Yes | There are no records matching the query constraints, so there are no timestamps to be compared. This matches the expectation that after the deletion is complete, no records should be found. |
10 | invalidRecordCount | invalidRecordCount = removableRecordCount | 0 | 0 | Yes | |
11 | invalidS3FileCount |
| 0 - empty list | 0 | Yes | |
12 | invalidS3Files |
| 0 - empty list | (empty list) | Yes | |
13 | listOfLogFilesBefore |
| 0 - empty list | 0 (empty list) | Yes | |
14 | validRecordCount | SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND AND metadata.recordGeneratedBy != TMC AND
| 0 | 0 | Yes | |
15 | validS3FileCount | SELECT COUNT(s3-filename) WHERE metadata.odeReceivedAt >= 2/13/2019 AND AND metadata.recordGeneratedBy != TMC AND [ | 0 - empty list | 0 - empty list | Yes | |
16 | validS3Files | SELECT s3FileName WHERE AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1 ] | 0 - empty list | 0 - empty list | Yes | |
17 | listOfLogFilesAfter | SELECT metadata.logFilename WHERE AND metadata.recordGeneratedBy != TMC [ | 0 - empty list | 0 - empty list | Yes |