• Distribution-aware Data Collection: Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation for model training and analysis. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required. In this project, we study how to acquire such data in the most cost-effective manner. [papers] [code and data]

  • TSUBASA: Climate Network Analysis: A climate network represents the global climate system by the interactions of a set of anomaly time-series. Network science has been applied to climate data to study the dynamics of a climate network. The core task to enable network dynamics analysis on climate data is the efficient computation and update of the correlation matrix for user-defined time-windows on historical and real-time data. In this project, we focus on scalable and efficient computation of all-pair time-series correlation as well as search algorithms for (un)-correlated time-series. [papers] [code and data] [tsupy library]