Meiyen Chen, PhD, is a senior data scientist at Vpon. Meiyen and her team and engineers work with more than 1 billion rows of data every day. In addition to a large amount of raw data stored in JSON format in Hadoop, her team looks at data stored in Google Analytics, Microsoft SQL Server, and more.
“We found that it required a long time whenever we needed to produce a quarterly report that required us to query three months’ worth of data directly from our Hadoop system,” Dr. Chen explained.
“The team actually aggregated and extracted part of the data, those that users commonly access for key metrics, in a Teradata data warehouse to quicken the process,” she added.
Also, they found additional challenges when presenting the reports to end users. She said, “We would place the analytical findings in Excel spreadsheets and send them across to end users. The problem was that end users would not be able to extract information from external datasets for comparison when they are accessing information in the spreadsheets. They would often come back to my team to clarify certain parts or make additional requests.”
Furthermore, some end users needed access to updated data insights while outside of the office. They were frustrated with trying to review Excel spreadsheets from a phone or tablet computer.
This process of extracting and analyzing the data, generating the reports and working with end users to resolve issues could easily take up to six weeks, according to Dr. Chen.
The team would receive more than 10 request tickets asking for ad hoc data queries each week.