Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 212 Current »

Summary

WYDOT informed ODE team that the Driver Alert location data was coming out as zeros. ODE team investigated the data files provided by WYDOT team and confirmed the inconsistency. Further investigation into the code base revealed that a bug was introduced when changes enhanced metadata.serialId  field to allow more effective auditing and tracking of data in order to detect missing and duplicate records. The defect was introduced in the logic that calculates the metadata and NOT the payload; there was no impact to the data payload at any time.

The ODE team promptly released a patch to correct the error in the logic and released it to WYDOT for testing. Wyoming team pulled down the patch from GitHub, deployed it to development environment and confirmed the fix. Wyoming then deployed the fix to production environment and resumed the flow of data through the system.

Unfortunately, three weeks later, WYDOT detected an issue with the released patch and informed ODE team that bsmSource metadata field is missing from the ODE output stream (https://github.com/usdot-jpo-ode/jpo-ode/issues/316 ). ODE team investigated the issue and confirmed the defect. Investigation into the code base revealed that a bug was introduced when the previous bug fix for metadata was implemented in mid February. The defect was introduced in the same area of logic that calculates the metadata.

On 3/6/2019 in the ODE Sprint Review and Planning meeting, both issues were discussed and the following action items where recorded and assigned.

Details

This section describes the origin of the issues and their resolution.

Origin

An update to the LogFileToAsn1CodecPublisher class was made to support easier troubleshooting of missing or duplicate records and was included in (PR #263, commit). This class takes the raw payload data and wraps it in metadata to create a distribution-ready message. In this class, the metadata object is created once and then updated for each message that passes through. Reusing the metadata object instead of recreating it is done for performance reasons: it reduces memory usage and improves message processing time. The update changed the logic of how this metadata was populated and a statement for updating the metadata was missed. Therefore the metadata was not updated between each message and instead reused with the same information, resulting in the duplicate location and timestamp fields populated in all metadata generated for records in the same data file. 

The new issue was a result of the bug fix implemented for the original metadata defect: https://github.com/usdot-jpo-ode/jpo-ode/pull/313 .

Resolution

The bug fix was merged in pull request #313 on GitHub in accordance with published ODE Software Quality Policy. This fixed the bug by correctly adding the code to update the metadata between each message. Locally the developer tested this fix by running the ODE, uploading data files, and visually inspecting the metadata of the output messages. After the developer was confident with local testing that the fix was effective, they pushed their code to GitHub. The ODE developers performed the required peer code review on GitHub where the code changes were inspected for correctness and completeness. Once the code review verified the fix, the code was merged and promoted through the standard release process and WYDOT was notified that the fix was ready for User Acceptance Testing (UAT).

A fix for the second bug has also been identified and implementation and deployment plan is in progress.

Impact

The bug time periods when data corruption occurred is specified below.

Data TypeDeposited/odeReceivedAt Time Period
BSM with inconsistent metadata12/3/2018 - 2/12/2019
Received TIMs with inconsistent metadata (NOT Broadcast TIMs)12/3/2018 - 2/12/2019
Driver Alert with inconsistent metadata12/3/2018 - 2/12/2019
Distress Notifications with inconsistent metadata12/3/2018 - 2/12/2019
BSM with missing bsmSource metadata field2/13/2019 - dateOfBugFix

Data NOT impacted include:

  • Broadcast TIMs deposited to WYDOT, DataHub and SDC. Note: DataHub only receives broadcast TIMs from WYDOT
  • TIM, Driver Alert and Distress Notifications deposited since 2/13/2019

Resolution and Recovery

1. What data stores require corrective actions to clean data?

The original defective code was deployed at WYDOT's new Production (PROD) server sometime in late November. The PROD server was brought online at the beginning of December (12/3/2018). WYDOT initiated their own clean up of WYDOT internal data logging on 2/13/2019 and began re-uploading all the data from 12/3/2018 to 2/12/2019. It is envisioned that BSM data will need to be uploaded again to recover from missing bsmSource issue. To avoid having invalid data in SDC and DataHub repositories, these repositories will also need to remove the invalid data and await WYDOT's re-upload for the good data to replace invalid data. ODE team is assisting the DataHub team in performing the corrective actions supported by automated AWS Lambda functions to ensure ONLY invalid data is removed. In summary:

  • Wyoming: Wyoming has removed all of the ODE generated data from their database during the 12/3/2018 to 2/12/2019 period and has re-uploaded all of the original log files back through ODE and into data store. However due the the bsmSource issue, all BSM data need to be removed and log files be uploaded again.
  • DataHub: All affected Wyoming data with metadata.odeReceivedAt >= 12/3/2018 until dateOfBugFix AND metadata.recordGeneratedBy != TMC should be removed from S3 bucket. Also a notification message will be posted on DTG and Sandbox to alert users of the inconsistencies and the time frame of the impending correction.
  • SDC: All affected Wyoming data with metadata.odeReceivedAt >= 12/3/2018 until dateOfBugFix AND metadata.recordGeneratedBy != TMC should be removed from data lake / raw submissions bucket and the Data Warehouse. Also a notification message will be posted on SDC site to alert users of the inconsistencies and the time frame of the impending correction.
  • End Users: A notification message will be posted to inform the users of the inconsistencies and the time frame of the impending correction.


2. What process will be used to identify, isolate, and remove invalid data?

All data uploaded in the bug time periods with the exception of Broadcast TIM will be invalid. To ensure that no valid data will be removed from the system, an AWS Lambda function is being created to collect metrics that can be used to verify and systematically and correctly remove invalid data. 

For DataHub, the AWS Lambda function will help identify the count and location of invalid metadata. Analysis of this report and any additional checks required by the data store maintainers, will determine the scope of the clean up. The Lambda function can then be used to automatically delete the invalid data from S3.

For SDC, ODE team will work with the SDC team to remove invalid data using mutually agreed methods to identify invalid data and generate a report of affected files. Again, analysis of this report and any additional checks required by the data store maintainer, will determine the scope of the invalid data and resulting deletions from data lake / raw submissions bucket and the Data Warehouse.

The following metrics will aid in assuring the integrity of the data deposited to SDC and DataHub during the bug time period.

Pseudo-queries for Identifying Invalid Data

The pseudo-queries in this section will help data store maintainers find and possibly clean-up the data. The primary purpose of these quesies are to collect metrics for verifying that the clean-up has been successful in removing only the invalid data. The key part of these queries is the WHERE clause. The SELECT clause can be modified for both collecting metrics and modified to perform clean-up in an automated fashion. It is up to the data maintainers to use these pseudo-queries in any way that helps them identify and clean-up the repositories. 

These sections are intended to be run somewhat sequentially as a checklist or procedure for analysis and cleanup. 

Useful variable declaration

totalRecordCount = SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix

This variable should contain total count of records including valid and invalid records deposited in the repository in the bug time periods.

WYDOTDataHubSDC
  • Result Before Cleanup: TBD
  • Result After Cleanup: TBD (should be less than Before)
  • Result Before Cleanup3,601,067
  • Result After Cleanup95,065
  • Result Before Cleanup:

    BSM: 6,458,749

    TIM: 416,763

  • Result After Cleanup: TBD (should be less than Before)


timBroadcastRecordCount = SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy == TMC 

This variable should contain the count of TIM Broadcast messages only deposited in the bug time periods.

WYDOTDataHubSDC
  • Result Before Cleanup: TBD
  • Result After Cleanup: TBD (should be the same as Before)
  • Result Before Cleanup95,065
  • Result After Cleanup95,065
  • Result Before Cleanup74,843
  • Result After Cleanup: TBD (should be the same as Before)


goodOtherRecordCount = SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload AND metadata.recordGeneratedBy != TMC

This variable should contain the count of all valid non-BSM messages deposited during the bug time periods. This count should be equal to the count of TIM Broadcast messages for the entire bug period (12/3/2018 - dateOfBugFix) plus received TIM messages deposited on and after 2/13/2019 through dateOfBugFix.

WYDOTDataHubSDC
  • Result Before Cleanup: TBD
  • Result After Cleanup: TBD (may be less than as Before due to multiple uploads)
  • Result Before Cleanup: 0
  • Result After Cleanup: 0 (should be ZERO)
  • Result Before Cleanup:

    BSM: 0

    TIM: 226,457


  • Result After Cleanup: TBD (may be less than as Before due to multiple uploads)


badBsmRecordCount = SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload 

This variable should contain the count of all invalid BSM records deposited during the bug time periods.

WYDOTDataHubSDC
  • Result Before Cleanup: TBD
  • Result After Cleanup: should be ZERO
  • Result Before Cleanup3,506,002
  • Result After Cleanup: 0 (should be ZERO)
  • Result Before Cleanup6,458,749
  • Result After Cleanup: should be ZERO


badOtherRecordCount = SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload 

This variable should contain the count of all invalid non-BSM messages deposited during the bug time periods. This count should be equal to the count of invalid received messages for the period 12/3/2018 - 2/12/2019.


WYDOTDataHubSDC
  • Result Before Cleanup: TBD
  • Result After Cleanup: should be ZERO
  • Result Before Cleanup: 0
  • Result After Cleanup: 0 (should be ZERO)
  • Result Before Cleanup:

    BSM: 0

    TIM: 115,463


  • Result After Cleanup: should be ZERO


removableRecordCount = badBsmRecordCount + badOtherRecordCount

This is to total count of invalid records and can safely be removed from the repository.

WYDOTDataHubSDC
  • Result Before Cleanup: TBD
  • Result After Cleanup: should be ZERO
  • Result Before Cleanup3,506,002
  • Result After Cleanup: 0 (should be ZERO)
  • Result Before Cleanup:

    BSM: 6,458,749

    TIM: 115,463


  • Result After Cleanup: should be ZERO


totalRecordCount equal to (==) (timBroadcastRecordCount  + removableRecordCount + goodOtherRecordCount)

This check should validate the analysis of the record counts. totalRecordCount should be equal to the sum of timBroadcastRecordCount  + removableRecordCount + goodOtherRecordCount.

WYDOTDataHubSDC
  • Result Before Cleanup: should be TRUE
  • Result After Cleanupshould be TRUE
  • Result Before Cleanup: TRUE
  • Result After Cleanup: TRUE (should be TRUE)
  • Result Before Cleanup: TRUE
  • Result After Cleanupshould be TRUE


earliestGeneratedAt = SELECT MIN(metadata.recordGeneratedAt) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

This query returns the generatedAt timestamp of the first record deposited during the bug time periods. This variable will be used during the validation of the cleanup.


WYDOTDataHubSDC
  • Result Before Cleanup: TBD
  • Result After Cleanup: TBD
  • Result Before Cleanup1970-01-01T00:00:00Z
  • Result After Cleanup2019-04-11T15:40:30.395Z
  • Result Before Cleanup: TBD
  • Result After Cleanup: TBD


latestGeneratedAt = SELECT MAX(metadata.recordGeneratedAt) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

This query returns the generatedAt timestamp of the last record deposited during the bug time periodsThis variable will be used during the validation of the cleanup.


WYDOTDataHubSDC
  • Result Before Cleanup: TBD
  • Result After Cleanup: TBD
  • Result Before Cleanup2019-04-11T15:45:58.766Z
  • Result After Cleanup2019-04-11 15:40:30.395000
  • Result Before Cleanup: TBD
  • Result After Cleanup: TBD


Query the Count of Invalid Data

invalidRecordCount = removableRecordCount

This variable is the same as removableRecordCount and is here for clarification only,

WYDOTDataHubSDC
  • Result Before Cleanup: TBD
  • Result After Cleanup: Should be zero
  • Result Before Cleanup3,506,002
  • Result After Cleanup: 0 (Should be zero)
  • Result Before Cleanup: TBD
  • Result After Cleanup: Should be zero


invalidS3FileCount = SELECT COUNT(s3-filename) WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

This variable is the count of invalid S3 bucket files.

WYDOTDataHubSDC
  • Result Before Cleanup: TBD
  • Result After Cleanup: Should be zero
  • Result Before Cleanup1,488
  • Result After Cleanup: 0 (Should be zero)
  • Result Before Cleanup: TBD
  • Result After Cleanup: Should be zero


Query a List of Invalid Data Files on S3

invalidS3Files = SELECT s3FileName WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

This query returns a list of all S3 bucket files that are invalid. This list can be used to verify the clean up. This list can also be saved to a file to be used as input to other scripts.

WYDOTDataHubSDC
  • Result Before Cleanup: TBD
  • Result After Cleanup: Should return no files
  • Result Before Cleanup: TBD
  • Result After Cleanup: Should return no files


Query a List of source data files resulting in invalid data

listOfLogFilesBefore = SELECT metadata.logFilename WHERE 

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt <= 2/12/2019 AND metadata.recordGeneratedBy != TMC AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) OR

(metadata.odeReceivedAt >= 12/3/2018 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload)

This query returns a list of all input source files that resulted in invalid S3 files. This list can be used to verify the clean up. This list can also be saved to a file to be used as input to other scripts.

WYDOTDataHubSDC
  • Result Before Cleanup: TBD
  • Result After Cleanup: Should return no files
  • Result Before Cleanup: TBD
  • Result After Cleanup: Should return no files



3. What process will be used for verification of corrective actions to address data store (WYDOT, SDC and DataHub) clean-up?

Manual:

  • A sampling of messages will be manually inspected to check that all fields in the metadata are correctly populated with data that makes sense

Automated:

  • At initial implementation (future context checks will be added as determined valuable by the team and approved Product Owners) AWS Lambda function will iterate through all messages with data from the bug time period to perform several context checks including but not limited to the following:
    • Timestamps are not duplicated
    • Location data are not duplicated 
    • Serial numbers are sequential and not duplicated
    • Key required fields are not missing
  • The AWS Lambda function is being developed and tested. In prototype-state it has been used initially to identify potentially invalid data in the month of December. This function will be used to analyze all of the data in the affected time period and identify invalid data.
  • Removal of the data will not be performed until a later date when the report of corrupted data files is determined to be complete and correct

Pseudo-queries for Validating the Clean Up (after the completion of the clean-up)

Query the Count of Valid Data

validRecordCount = SELECT COUNT(*) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC AND metadata.recordGeneratedAt >= earliestGeneratedAt AND metadata.recordGeneratedAt <= latestGeneratedAt [AND metadata.serialId.bundleSize > 1]

This variable is the count of valid records in the bug time periods.


WYDOTDataHubSDC
  • Result Before CleanupTBD (should be greater than or equal invalidRecordCount or removableRecordCount)
  • Result After Cleanup: TBD (should be the same as Before Cleanup)
  • Result Before Cleanup2,012,382 (should be greater than or equal invalidRecordCount or removableRecordCount)
  • Result After Cleanup: 0
  • Result Before CleanupTBD (should be greater than or equal invalidRecordCount or removableRecordCount)
  • Result After Cleanup: TBD (should be the same as Before Cleanup)


validS3FileCount = SELECT COUNT(s3-filename) WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC AND metadata.recordGeneratedAt >= earliestGeneratedAt AND metadata.recordGeneratedAt <= latestGeneratedAt [AND metadata.serialId.bundleSize > 1]

This variable is the count of valid S3 files deposited in the bug time periods.


WYDOTDataHubSDC
  • Result Before Cleanup(should be greater than or equal invalidS3FileCount)
  • Result After CleanupTBD (should be similar to Before Cleanup)
  • Result Before Cleanup920 (should be greater than or equal invalidS3FileCount)
  • Result After Cleanup0 (should be similar to Before Cleanup)
  • Result Before Cleanup(should be greater than or equal invalidS3FileCount)
  • Result After CleanupTBD (should be similar to Before Cleanup)


Note: Due to the fact that the bug was introduced with serialId changes, most if not all invalid metadata records will have a bundleSize of greater than one. Comparing the counts with and without this filter parameter will aid in identifying invalid data.

[bracket] indicates optional parameter


Query a List of Valid Data Files on S3

validS3Files = SELECT s3FileName WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1]

This query returns a list of S3 files from the bug time periods.

WYDOTDataHubSDC
  • Result Before CleanupTBD (should return a similar list as invalidS3Files)
  • Result After Cleanup: TBD (should be the same as Before Cleanup)
  • Result Before Cleanupvalid_s3_file_list_fixed.txt (should return a similar list as invalidS3Files)
  • Result After Cleanup: empty list (should be the same as Before Cleanup)
  • Result Before CleanupTBD (should return a similar list as invalidS3Files)
  • Result After Cleanup: TBD (should be the same as Before Cleanup)


[bracket] indicates optional parameter


Query a List of source data files

listOfLogFilesAfter = SELECT metadata.logFilename WHERE metadata.odeReceivedAt >= 2/13/2019 AND metadata.odeReceivedAt < dateOfBugFix AND metadata.recordGeneratedBy != TMC [AND metadata.serialId.bundleSize > 1]

This query returns the list of source log files uploaded to ODE during the bug time period and should be the same before or after the clean up.

WYDOTDataHubSDC
  • Result Before CleanupTBD (should return the same list as listOfLogFilesBefore)
  • Result After Cleanup: TBD (should be the same as Before Cleanup)
  • Result Before Cleanupvalid_log_file_list.txt (should return the same list as listOfLogFilesBefore)
  • Result After Cleanup: empty list (should be the same as Before Cleanup)
  • Result Before CleanupTBD (should return the same list as listOfLogFilesBefore)
  • Result After Cleanup: TBD (should be the same as Before Cleanup)


Metadata Bug Analysis and Recovery Reports


Validation Checklist


WYDOTDataHubSDC
  • validRecordCount MUST BE Greater than or Equal to the original invalidRecordCount
  • Re-running the query for invalidRecordCount should yield ZERO
  • validS3FileCount MUST BE Greater than or Equal to invalidS3FileCount
  • Re-running the query for invalidS3FileCount should yield ZERO
  • Re-running the query for List of Invalid Data Files on S3 should return no results
  • Re-running the query for List of source data files resulting in invalid data should return no results
  • validRecordCount MUST BE greater than or equal invalidRecordCount or removableRecordCount
  • Re-running the query for invalidRecordCount should yield ZERO
  • validS3FileCount MUST BE Greater than or Equal to invalidS3FileCount
  • Re-running the query for invalidS3FileCount should yield ZERO
  • Re-running the query for List of Invalid Data Files on S3 should return no results
  • Re-running the query for List of source data files resulting in invalid data should return no results
  • validRecordCount MUST BE Greater than or Equal to the original invalidRecordCount
  • Re-running the query for invalidRecordCount should yield ZERO
  • validS3FileCount MUST BE Greater than or Equal to invalidS3FileCount
  • Re-running the query for invalidS3FileCount should yield ZERO
  • Re-running the query for List of Invalid Data Files on S3 should return no results
  • Re-running the query for List of source data files resulting in invalid data should return no results


4. What are target timelines for additional Quality Assurance, Bug Fix, UAT testing of the bug fix, identifying, correcting and validating the data?


TaskDescriptionOwnerTarget  Completion DateActual Completion Date
1
  • QA Validation Checklist and Scripts
ODE Will Incorporate a QA Checklist and Script into their QA process.

BAH

3/19/2019 (End of ODE Sprint 43)3/19/19
2
  • bsmSource bug fix
ODE will fix the missing bsmSource defect and run the QA Validation checklist and Scripts to verify the and regression test. This fix will be on top of the new SDW feature implementation.BAH3/19/2019 3/19/19
3
  • DataHub Contextual Validity Checker (Canary)
ODE will deploy an AWS lambda function to check for data inconsistencies as they arrive on DataHub. Such inconsistencies may include SerialId serial numbers grossly out of order or repeating, time stamps repeating, required fields missing or null, etc. These Lambda functions will be shared with the community on GitHub so SDC and WYDOT teams will also be able to utilize it for data validation.  ODE-1230 - Getting issue details... STATUS BAH4/3/20194/4/19
4
  • ODE software release v1.0.5 
ODE will release the validated ODE software to start UAT testing by WYDOT. BAH3/19/20193/19/19
5
  • WYDOT UAT
WYDOT will deploy jpo-ode-1.0.5 on their DEV server and perform UAT testing. The checklist and validation scripts will be available to them for test and validation.  SDC team will test in their test environment.WYDOT3/26/20193/25/19
6
  • DataHub Sandbox folder re-structuring decision

This item is not related to metadata bug but for sake of coordination this task was inserted to assess whether it would be beneficial and more efficient to perform folder restructuring at the same time as data clean-up and before data re-upload. It was ultimately decided and approved by PO that folder restructuring adds unnecessary complexity to the clean-up process and schedule and is best to be deferred until after the data has been completely restored. See  ODE-1229 - Getting issue details... STATUS  and  RDA-674 - Getting issue details... STATUS  for details.

Excerpt from approval email:


Finally, based on most recent discussions with you for clarity, we have determined that the folder structure changes that have been discussed will not impact the timing of the data cleanup process we have documented and planned with WYDOT, DataHub, and SDC.  But we will again coordinate with “Scrum of Scrums” to analyze, plan and execute any folder structure work after cleanup activities are completed.  After any folder structure change implementation work is completed, communications to data store end users will then be consolidated to include BOTH cleanup close out communications and any additional required notification for folder structure changes.


BAH3/27/2019
7
  • Commencement of cleanup coordination activities

All teams, ODE, Wyoming, DataHub and SDC collaborate on approach to remove invalid data from repositories. Meeting with WYDOT to establish UAT completion timeline. 

ODE-1226 - Getting issue details... STATUS   RDA-667 - Getting issue details... STATUS

ALLweek of 3/20/20193/25/19
8
  • GO/ NO-GO FOR WYDOT PROD DEPLOYMENT


ODE, DataHub and SDC teams will verify that freshly arrived BSM, TIM and Driver Alert messages are correct and consistent. The validation checkers deployed to both systems should also verify data validity in DEV.


COORDINATION MEETING TO CONFIRM ALL TEAM ARE RECEIVING CLEAN DATA FROM WYDOT GO/ NO-GO FOR WYDOT PROD DEPLOYMENT


  • Verify DataHub using the Lambda "Canary" function running in BOTH Test and Prod (reference task = RDA-668 - Getting issue details... STATUS )
  • Verify SDC using ODE Validator in BOTH Test and Prod (reference task = DOTAS-720)
  • WYDOT
  • DataHub
  • SDC

4/5/2019

4/11/2019

4/11/2019
9
  • WYDOT PROD deployment
UAT testing completed; Wyoming will promote v1.0.7 to PROD environment AFTER coordination meeting with stakeholders ref:  ODE-1211 - Getting issue details... STATUS  -→  ODE-1200 - Getting issue details... STATUS ODE-1202 - Getting issue details... STATUS WYDOT

4/5/2019

4/11/2019

4/11/2019
10
  • Query and populate 
Useful variable declarations
ODE collects counts as specified in the Useful variable declarations section. This exercise will identify earliest and latest generatedAt time for bad data. These counts can be used in cleanup verification.  ODE-1200 - Getting issue details... STATUS ODE-1202 - Getting issue details... STATUS
  • WYDOT
  • DataHub
  • SDC
4/17/2019
11
  • Identify the quantity of invalid
records and the quantity of S3 files containing invalid records

Run the queries mentioned in the Query the Count of Invalid Data section above.

Note: BSM and TIMs Data Types may require separate cleanup processes as affected BSMs can likely be identified and removed without additional analysis. TIM messages will likely need additional effort to isolate affected TIM due to inability to re-upload unaffected Broadcast TIMs. Since DataHub repository only contains Broadcast TIM which were not affected by the bug, no action is required for DataHub TIM cleanup. Only invalid BSMs received from 12/3/2018-dateOfBugFix will need to be removed. SDC would need to remove all received TIMs for the received period of 12/3/2018 - 3/12/2019. WYDOT has already removed all invalid TIM and the original invalid BSM. Only BSM received from 2/13/2019 - dateofBugFix must be removed and uploaded again. WYDOT, SDC and DataHub will have all BSM records refreshed after the bug fix is deployed to WYDOT PROD server.  WYDOT should review whether their deduplication software will remove erroneous data or retain the erroneous data.  ODE-1200 - Getting issue details... STATUS ODE-1202 - Getting issue details... STATUS

  • WYDOT
  • DataHub
  • SDC
4/19/2019
12
  • Identify the names of the S3 files containing invalid records 
Run the query mentioned in the Query a List of Invalid Data Files on S3 section above.   ODE-1200 - Getting issue details... STATUS ODE-1202 - Getting issue details... STATUS
  • WYDOT
  • DataHub
  • SDC
4/19/2019
13
  • Identify the log file (original data file) names containing invalid records 
Run the query mentioned in the Query a List of source data files resulting in invalid data section above.   ODE-1200 - Getting issue details... STATUS ODE-1202 - Getting issue details... STATUS
  • WYDOT
  • DataHub
  • SDC
4/19/2019
14
  • Create report summarizing findings of invalid record generatedAt times, count, and S3 and log file names
The results of the queries from tasks 9-12 should be aggregated into a report for summary and understanding to be presented to the product owner.   ODE-1200 - Getting issue details... STATUS ODE-1202 - Getting issue details... STATUS
  • WYDOT
  • DataHub
  • SDC
4/19/2019
15
  • GO/ NO-GO on DATA REMOVAL


COORDINATION MEETING CONFIRM ALL TEAMS' DATA REMOVAL PLANS WITH ALL STAKEHOLDERS ODE-1212 GO/ NO-GO on DATA REMOVAL


  • Verify DataHub  (reference task = RDA-669 - Getting issue details... STATUS )
  • Verify SDC (reference task: DOTAS-721)
  • WYDOT
  • DataHub (GO)
  • SDC

4/26/2019

Recovered Schedule and conducted meeting on 4/19/2019

4/19/2019
16
  • Perform cleanup

Once confidence in the summary findings is gained, the Lambda function used to run the queries in tasks 9-12 should be modified to delete the S3 files that are found using the queries. Running this function will be the cleanup step.  ODE-1212 - Getting issue details... STATUS  -→   ODE-1203 - Getting issue details... STATUS ODE-1204 - Getting issue details... STATUS

RDA-670 - Getting issue details... STATUS

  • WYDOT
  • DataHub
  • SDC
4/19/2019
17
  • Rerun queries to ensure that cleanup was successful

The queries mentioned in the above section for Pseudo-queries for Validating the Clean Up:

  • Query the Count of Valid Data
  • Query a List of Valid Data Files on S3
  • Query a List of source data files

should be run and ensured that the expected output matches the actual output.    ODE-1203 - Getting issue details... STATUS ODE-1204 - Getting issue details... STATUS

  • WYDOT
  • DataHub
  • SDC


18
  • Validate that the cleanup performed the corrective actions
The Validation Checklist above should be iterated to ensure that the cleanup actions deleted the invalid data and did not affect the valid data.

In addition to the automatic validation steps using the queries, manual inspection of the bucket should be performed as a sanity check.  ODE-1203 - Getting issue details... STATUS ODE-1204 - Getting issue details... STATUS
  • WYDOT
  • DataHub
  • SDC


19
  • GO/ NO-GO ON WYDOT RE- UPLOAD

COORDINATION MEETING TO CONFIRM GO/ NO-GO ON WYDOT RE- UPLOAD ODE-1269 - Getting issue details... STATUS

DataHub and SDC expect the following to be re-uploaded by WYDOT:

  1. ALL BSM (bsmTx, bsmLogDuringEvent, rxMsg), TIM (rxMsg, dnMsg), Driver Alert (driverAlert) files received from 12/3/2018 - 4/11/2019 inclusively need to be re-uploaded.


  • Verify DataHub  (reference task = RDA-751 - Getting issue details... STATUS )
  • Verify SDC (reference task: DOTAS-722)

4/26/2019
20
  • WYDOT Starts Re-upload of historical data

WYDOT team starts re-uploading of all data files that were identified during the analysis phase as invalid and deposits them to respective data stores.


  • WYDOT sends e-mail communication when re-upload begins and communicates if there is any risk to meeting Step 20 below by 5/3/2019

Due to inconsistencies between data stored in WYDOT database and SDC, investigations was initiated with the following results: 

  • It was concluded that Lear device is putting duplicate records in the data files that are being uploaded to ODE
  • Lear will be fixing the issue in the next release but we are not sure when that release will come and be deployed. Therefore, we must do with the current release and post-process the data to remove duplicates.
  • Both WYDOT and SDC have de-duplication in place before data is deposited to the Oracle database and SDC Data Warehouse. However, the de-duplication is potentially flawed as it does not take the source of the BSM “metadata.bsmSource” into account. As is, the deduplication could potentially remove BSMs that are uploaded by different vehicles (EV or RV).
  • We do not know and cannot easily determine if all the data has been deposited to SDC and WYDOT due to the excessive load during the initial re-upload, Lear duplicate record issue and deduplication deficiencies.
  • Currently the deduplication is based on the following field values:
    • MSGCNT
    • SECMARK
    • POSITION_LAT
    • POSITION_LONG
    • POSITION_ELEV
    • RECORD_GENERATED_AT
    • LOG_FILE_NAME
  • BSM_SOURCE must be added to the list to perform a more correct de-duplication of BSM data
  • Brandon will be out of the office 5/20-5/22 and will available to start re-upload on 5/23

WYDOT

4/30/20194/30/2019
21
  • De-dup correction
  • Brandon has added the additional BSM_SOURCE field in the de-duplication algorithm which is under test on DEV. Brandon will deploy the new de-dup to PROD by 5/15/2019 00:00:00 UTC. If unable to complete these tasks by 5/15/2019 00:00:00 UTC, we could simply record the UTC time when the new de-dup process is in place and will use that time stamp for the cleanup (WYDOT de-dup timestamp).
  • SDC will implement the fix to de-dup process. SDC will record when the de-dup process is deployed as the “SDC de-dup timestamp” for removal of data.
  • DataHub will propose to Ariel to implement de-dup process as part of the Folder Restructuring effort. (Michael Middleton (Unlicensed)Lien, Julia [USA] (Unlicensed))
  • WYDOT de-dup timestamp = 2019-5-15T00:00:00.000Z
  • SDC de-dup timestamp = ???
5/14/20195/14/2019
22
  • Partial/Duplicate Data Removal
  • Brandon will remove all BSM, Received TIM and Driver Alert data from 12/3/2018 00:00:00 UTC through 5/15/2019 00:00:00 UTC or whatever time the new de-dup is deployed from Oracle database. TIM Broadcast data shall not be removed. (@Brandon Payne)
  • SDC will remove all BSM, Received TIM and Driver Alert data from 12/3/2018 00:00:00 UTC through the “WYDOT de-dup timestamp” from the Data Warehouse and correct any missing data in the data warehouse after the fact without impacting other teams. (@Chupp, William (Volpe), @Mayorskiy, Vyacheslav CTR (Volpe)). 
  • DataHub will also remove all data from 12/3/2018 00:00:00 UTC through 5/15/2019 00:00:00 UTC from the Sandbox. (Matthew Schwartz (Unlicensed), Lien, Julia [USA] (Unlicensed))
  • De-dup correction and Data Removal must be completed by close of 5/22/2019 so when Brandon returns and starts re-uploading data all repositories are ready.
  • WYDOT 5/14/2019
  • SDC
  • DataHub


5/22/2019
23
  • Start Re-upload Round 2

On 5/23/2019, Brandon will re-upload all data from 12/3/2018 00:00:00 UTC through 5/15/2019 00:00:00 UTC.

WYDOT5/23/2019
24
  • DataHub Release Note

DataHub will post a Release Note indicating that there are and will be duplicate records in the data until further notice. When that is depends on Lear’s fixing of the firmware. (Lien, Julia [USA] (Unlicensed)Michael Middleton (Unlicensed))

BAH5/23/2019
25
  • Update this page

Hamid Musavi (Unlicensed) will update the metadata bug Confluence page with current status, understanding and action plan. Meanwhile, copying @Ariel.Gold@dot.gov so she is informed of where we are in the re-upload process.

BAH5/17/2019
26
  • VERIFICATION OF SUCCESSFUL RESTORATION OF DATA

COORDINATION CLOSE OUT E-MAIL REPORTS TO CONFIRM VERIFICATION OF CLEAN UP COMPLETED


  • After WYDOT upload, confirm that data in WYDOT BSM data set on DTG all clean  RDA-760 - Getting issue details... STATUS


  • WYDOT
  • DataHub
  • SDC
6/17/2019

5/3/2019 held meeting, but "NO-GO" - Data re-upload issue resolution ongoing with daily meet-up:

  • As of 5/14/19 data duplication bug uncovered in Lear source data to WYDOT
27
  • Communicate to data users

Communicate to all data users of the WYDOT, SDC, and document on DataHub that the cleanup is complete

  • Verify DataHub  (reference task = RDA-671 - Getting issue details... STATUS )
BAH5/31/2019


Prevention


1 - What preventative measures will be put in place to better detect future potential “bad” data defect events and notify ODE stakeholders of potentially invalid data

Immediate actions:

  • ODE unit test suite will be enhanced to detect defects of this nature at development time ( ODE-1158 - Getting issue details... STATUS )
  • An AWS Lambda function will be deployed on DataHub and shared with SDC team to be deployed there to capture and automatically alert Operations teams and Data Provider when certain key contextual test rules are violated. For example, out of sequence or duplicate serialId or timestamp; zero or null values in required fields such as lat/long/speed/heading if persistent over a period of time. Etc. ( ODE-1199 - Getting issue details... STATUS )

Long-term actions pending Product Owner approval:

  • Automated integration testing will be introduced to the ODE Software Quality Policy to provide end-to-end test coverage through test harness implementation; significantly reducing reliance on visual inspection during local development and prior to Prod deployments
  • Data Validation feature added to ODE to flag any potential data elements that are outside a preconfigured validity range
  • Contextual data quality checker module will be added to the ODE backlog and present to Product Owner for approval and prioritization.

Note: Currently payload data (ODE inputs) are validated against the J2735 standard with essentially 100% unit test coverage; only current gap for payload testing is in validation of incoming JSON during encoding of TIMs. Risk in this area is low given that the encoder module will fail to encode invalid TIMs input potentially generated by bug or error in incoming JSON

2 - What process will be used to investigate potential “bad” data detection alerts to initiate a Data Incident Response?

The proposed AWS Lambda corrupt-data "canary" will function both in test and Prod DataHub storage S3 buckets, running every day to check for data for contextual validity. Upon a failure of this check, the Lambda function will automatically generate a report of the affected data and the checks failed, and will email that report to the relevant parties.

Upon receipt of this email alert, the appropriate parties will begin using both manual and automated inspection tools to inspect the alerted data and confirm the corruption. The investigators will work to isolate the affected data and the cause of the corruption, and then begin the remediation process. Part of this process will include creating a post-mortem report.


3 - What is the Data Incident Response protocol for data set maintainers, including initial notification and communications

a. Impact analysis / triage?

Data incident responders will begin identifying the scope of the corruption and the cause.


b. Defect resolution?

Data incident responders will alert the data store maintainers to coordinate suspension of data depositing and inform all potentially impacted systems and projects. Corrupt data MAY then be removed. When a bug fix is ready, the fix will be deployed and the invalid data shall be re-uploaded.


c. Data Set impact analysis?

Responders will use both manual inspection and automated tools to identify the full impact of the bug.


d. Data Set corrective action planning?

Once the cause is identified, a post-mortem will be created. The post-mortem document will be used to discuss how the data could be caught, mitigated, and prevented in the future.


e. Analysis of additional preventative measures?

Upon identification of the cause, preventative measures will be contemplated and enacted.

Action & Decision Items from Sprint Review/ Planning 3/6/19

ItemTitleDescriptionOwnerTarget Date
1Action: Detail separate approaches for TIMs cleanup and BSM cleanup to have most expedient and low-risk approach for each data type

Since Broadcast TIMs were not affected by the Metabugs and are impossible for WYDOT to re-upload, TIMs cleanup will require some more checks for removing only the affected Received TIMs in impacted data stores, while BSMs can likely be removed and re-uploaded in the impacted data stores in a simpler process with fewer steps.

Update the "2. What process will be used to identify, isolate, and remove invalid data?" section above to detail the approach for TIMs and BSM data types separately with concurrence from WYDOT, DataHub, and SDC to ensure most expedient and low-risk approaches and processes are optimized for clean-up of each data type.

Cleanup processes should be result of collaboration between WYDOT, ODE, DataHub, and SDC and employ one set of shared validation tools for data store cleanup.

Updated document above needs to be sent to Ariel before 4:30 pm eastern time on 3/8/19.

  • BAH
3/8/19
2Action: Consolidate full analysis, impact, recovery and prevention for both Metadata bug events in one document (above) to support a single cleanup processSince both Metadata bug events are related the document above will be updated with additional content related to the bsmSource bug.
  • BAH
3/7/19
3Decision: Do not start the data store cleanup until validated fix for second bug (bsmSource issue) is deployed and complete.

To lower risk of any additional Metadata bug events and avoid having to complete two separate cleanup cycles on the impacted data stores, ODE will implement the approved Quality Assurance measures detailed in ODE-1206, ODE-1207, and ODE-1208 before implementing the fix for the bsmSource bug in ODE-1209.

After ODE-1209 is completed, then the data store cleanup process will be completed.

  • BAH
3/6/19
4Action: Verify that Socrata may not require ANY cleanup actions based on how many records it holds and may "clean itself" in reasonable timeframeDue to the amount of records that Socrata holds (3 million) and data rate from WYDOT of approximately 2 million records per day, it is likely that old data will be removed automatically and no records impacted by Metadata bugs will be in the system after approximately 36 hours.  Need to verify that no cleanup actions are required for Socrata.3/8/19
5Action: Immediate notification with timeline communicated to end usersEnsure that end users of the impacted data stores are notified with a reasonable timeline for when data cleanup will be completed and impact to them. This notification needs to be sent before 4:30 pm eastern time on 3/8/19.  See notification statement below.
3/8/19
6

Action: Validation feature to be added to data stores' (SDC, WYDOT, DataHub) backlogs

Coordinate between teams (perhaps through scrum of scrum) and get the ingest validation feature added to their respective backlogs utilizing the output validation scripts that the ODE team will develop. The scripts should be reusable for each team’s data ingest validation process (e.g., for SDC, WYDOT, Data Hub) to ensure that they validate all data coming in.3/8/19


Notification

Data Defect Notification (3/8/2019)

Statement for end users of data stores - To be posted to relevant end user area for WYDOT, DataHub, DTG, and SDC no later than 4:30 pm eastern time on 3/8/19:


We would like to bring to your attention some errors in the Wyoming metadata fields described below:

Data TypeInconsistency ObservedDeposit Time Frame
  • Basic Safety Messages (BSM)
  • Traveler Information Messages (TIM)
  • Driver Alert
  • Distress Notifications

Erroneous metadata observed in the following fields:

  • metadata.bsmSource
  • metadata.securityResultCode
  • metadata.receivedMessageDetails
  • metadata.odeReceivedAt
  • metadata.recordGeneratedAt
12/3/2018 - 2/12/2019
  • Basic Safety Messages (BSM)

Missing metadata observed in the following field:

  • metadata.bsmSource
2/13/2019 - Present

All payload data included in BSM and TIM messages as well as the entire metadata for TIM Broadcast messages (those with metadata.recordGeneratedBy = TMC) are unaffected and entirely correct. We expect that these metadata errors be fully corrected in a few weeks and all invalid metadata be removed shortly after. 

We appreciate your patience as we work through these metadata errors and if you have any question about these issues, please feel free to contact the Dataset Owner in the normal support channels.



For DataHub, we recommend the following short statement be added to Socrata description which will appear on here on DTG

“We would like to bring to your attention some errors in the metadata fields for Wyoming dataset deposited from 12/3/2018 – present, which we are working to fixPlease see full details in the ITS DataHub data sandbox

The ITS DataHub data sandbox will contain the full message as shown above, with portions referring to TIM removed as DataHub TIM data is unaffected.

For SDC and Wyoming, an email will be sent to data users to inform them of the metadata errors.


Data Cleanup Complete Notification (4/25/2019)

Below is a proposed comm tailored specifically for DataHub. The comm going out toSDC and WYDOT will be similar. The highlighted wording signifies customization for each specific user group, i.e. SDC and WYDOT. Please review the language and reply to all with your feedback.


This is an update regarding errors in the Wyoming metadata that was communicated to the user community on 3/8/2019. The bug manifested itself in incorrect metadata fields in the Wyoming Basic Safety Message (BSM) and Traveler Information Message (TIM) data. All payload data included in BSM and TIM are unaffected and entirely correct.

The ODE software version that contributed to the Metadata field errors has been corrected and deployed to Wyoming production server on 4/11/2019. No more invalid data is being deposited to DataHub as of 4/12/2019. All existing invalid data have also been removed. The historical data will be restored in the upcoming days and weeks. A final communication will be made once all data restoration has been completed.

We appreciate your patience as we work through resolving these metadata errors and if you have any question about these issues, please feel free tocontact RDAE_Support@bah.com.

Final Notification (7/2019)


SDC/WYDOT notification

Below is a proposed comm tailored for SDC and WYDOT, to be sent out via mailing list similar to prior communications. The underlined wording signifies customization for each specific user group, i.e. SDC and WYDOT. Please review the language and reply to all with your feedback.


This is an update regarding errors in the Wyoming metadata that was communicated to the user community on 3/8/2019 and 4/26/2019.

As noted previously, the bug manifested itself in incorrect metadata fields in the Wyoming Basic Safety Message (BSM) and Traveler Information Message (TIM) data. All payload data included in BSM and TIM are unaffected and entirely correct. The ODE software version that contributed to the Metadata field errors had been corrected and deployed to Wyoming production server on 4/11/2019. No more invalid data is being deposited to DataHub as of 4/12/2019 and all existing invalid data have been removed as of 4/26/2019.

Restoration of the historical data is complete as of 6/28/2019. To stay abreast of future issues regarding Wyoming Connected Vehicle Pilot data, we suggest that you monitor the open issues of the JPO Operational Data Environment repository at https://github.com/usdot-jpo-ode/jpo-ode. We also welcome you to report any issues that you see in the GitHub Issues of the repository.

We appreciate your patience as we work through resolving these metadata errors. If you have any question about these issues or have additional issues to report, please contact <insert appropriate email address here>.


DataHub

Below is a proposed comm tailored to be posted on ITS Sandbox http://usdot-its-cvpilot-public-data.s3.amazonaws.com/index.html. It contains fewer contextual details as it'll appear with the past notifications. Please review the language and reply to all with your feedback.


This is an update regarding errors in the Wyoming metadata that was communicated to the user community on 3/8/2019 and 4/26/2019. 

Restoration of the historical data is completed as of 6/28/2019. Furthermore, we will be reorganizing the Wyoming data folders to be based on generatedAt timestamp instead of receivedAt timestamp in the coming weeks to ensure easier retrieval of relevant data. To stay abreast of future issues regarding Wyoming Connected Vehicle Pilot data, we suggest that you monitor the open issues of the JPO Operational Data Environment repository at https://github.com/usdot-jpo-ode/jpo-ode

We appreciate your patience as we work through resolving these metadata errors. If you have any question about these issues, concerns about the reorganization of ITS Sandbox data, or additional data issues to report, please contact RDAE_Support@bah.com.




  • No labels