What can the data tell us about patterns, associations and the human factor?
As we are closing one year of operation of the air quality monitors in Syros, a data analysis becomes timely and appropriate. This post summarizes the results of an exploratory forensic analysis of the data collected from a certain air quality monitor. Specifically, we were looking for what kind of information we can extract from the emission and other data collected over an extended time period including emission patterns and their dependence on possible governing factors like human activity and other extrinsic environment conditions. The potential of forecasting pollution levels was also investigated. The results of the analysis show some correlation between emission levels and time and day (pointing to human activity) and a weaker one with other measured environment parameters.
Description of the data
A recent Facebook post  provides the probability distributions for a number of air pollutant species, constructed from the data collected from the three air-quality monitor sites operating on Syros. In that study, data are grouped according to the four seasons and four daily intervals. Subsequently, the probability of expected emission levels is computed against season and daily interval for each species. In the present study, we wanted to go beyond probabilities and look into patterns and associations. For this purpose, we investigated the hourly data for particulate matter (PM 2.5 and 10) and volatile organic compound (VOC) emissions from device 8200015E, spanning the period from October 7, 2018 to July 20, 2019, encompassing a total of 6888 hours or 287 full days. In addition to the pollutants mentioned, concurrent temperature, relative humidity and ambient noise data were also collected from the same device. The reason for choosing the particular air monitor is the completeness of its data. Follows a brief description of the main results and their interpretation.
Fig. 1 shows the probability density of all measured quantities. The average value of the quantity is also given on each graph. These probabilities are unconditional (prior) and they concern the entire observation period–they are not divided into seasons or day periods as it is done in –.
In the top three graphs showing the PM and VOC emission levels, we added these vertical dashed lines to show normal emission limits (at 30 ugr/m3 for PM and at 200 for the VOC-AQI–these are typically adopted emission limits in ordinary reports–). The % value shown to the right of the limit line is the probability that the emission level exceeds the limit (at any time). For example, for PM 10 and 2.5, the probability to exceed 30 ugr/m3 is 1.2 and 0.75 %, respectively, and for VOC-AQI, the probability to exceed 200 points is 33.16%. Finally, we should remark on the similarity of the PM 10 and 2.5, distributions (both seem to be Weibull types), as they are likely drawn from a common source.
Categorization of combined emissions
We created four emission categories based on two variables: the value of VOC-AQI and the sum of the concentrations of the two PM species (since PM 10 and 2.5 have similar statistical properties). Thus, we have the following categories:
Table 1. Definition of the Emission Categories.
|Category name and description||Category limits|
|C1: low emissions, clean air.||VOC-AQI<200 & PM<30 ugr/m3.|
|C2: high VOC and low PM emissions, |
|VOC-AQI≥200 & PM<30 ugr/m3.|
|C3: low VOC and high PM emissions, |
|VOC-AQI<200 & PM≥30 ugr/m3.|
|C4: high VOC and high PM emissions, very unhealthy air.||VOC-AQI≥200 & PM≥30 ugr/m3.|
The frequency of appearance of each emission category in the observations is depicted through the tile plot in Fig. 2.
Each tile in the figure represents a VOC-AQI range on the x-axis and a PM(2.5+10) concentration range on the y-axis; the color intensity of the tile is proportional to the number of times the x and y ranges of the tile occurred in the observations. The dashed lines divide the graph into the four emission categories in Table 1. The overall frequency of the emission categories is 64% for C1, 31% for C2, 3% for C3 and 2% for C4.
Probing for associations and dependencies
In this study, we tried to look for dependence between emissions, on one hand, and some possible factors governing or affecting them such as human activity and weather conditions, on the other. Unfortunately, data about wind were not available from the monitors. Neither were data about specific human activity (such as activity of suspected pollution sources). Instead, the daily hours, week days and noise levels were used as proxies for human activity. In the following, the correlation of the emission categories defined above with the daily hours, week days, noise, and temperature, was investigated. The results are explained below.
The top left graph in Fig. 3 is the plot of the emission category every hour computed from the recorded data and the criteria in Table 1 (the y-axis values 1 to 4 correspond to the category names C1 to C4). To obtain a comparison base for ruling out random factors, we constructed a random series of values from 1 to 4 shown in the top right graph. The relative proportions of the four values equal correspondingly the category proportions in Fig. 2. Subsequently, we computed the auto-covariance of both series for up to 72 hours of lag; these results are shown in the bottom two graphs in Fig. 3. The auto-covariance plot of the random series, as expected, is an impulse, consistent with uncorrelated white noise; the auto-covariance plot of the actual series, however, shows significant auto-correlation (that is, time dependence) and a periodicity of 24 hours consistent with daily cycles. This result confidently establishes a time pattern of the emissions ruling out any effects of randomness.
The hourly dependence of the emission categories was further investigated. The 6888 hours in observations were grouped into the 24 day-hours labeled as 0 for the first or 00:00 hour of the day to 23 for the last or 23:00 hour of the day. Subsequently, the frequency of each emission category in each day hour was computed. The results are given in the top graph in Fig. 4, where the x-axis is the day hour, the y-axis is the emission category 1, 2, 3, and 4, and the z-axis is the emission category frequency. Casual inspection of the graph reveals a dependence of Categories 1 and 2 with the day hour: C1 frequency decreases and C2 increases between hours 7 and 15 (3 pm). The dependency of C3 or C4 to the day hour is less obvious to an unaided inspection, due to the small size of these categories (their combined size is only 5% of the total observations).
We investigated the above observation more rigorously employing statistical methods. At this point it should be noted that computing the co-variance between day hour and emission category in order to prove or disprove dependence will not work generally since we do not anticipate a linear relationship between them. Consequently, we computed the conditional probabilities of each day hour in each emission category; these probabilities are plotted by the blue circles in the next four graphs in Fig. 4 for each category. The lines of the red x’s in the same plots represent the expected distribution of the day hours (that is, 1/24).
If the emission categories were uncorrelated (no dependence) to the day hours, then the red and blue lines in Fig. 4 should match closely. Inspection, however, of the four plots reveals, not only a mismatch between the two probabilities, but also a pattern for the conditional probabilities, even for the small categories C3 and C4. Of course, such mismatch could have been generated by random chance. Therefore, we experimented with random distributions of categories (preserving the original category proportions), and subsequently comparing the resulting probability quantiles. The results showed that the random distributions consistently produced no patterns and a closer fit to the red lines (as it was expected) then the actual category distributions in the figure; this helps to exclude randomness as a cause of the patterns in Fig. 4 and, hence, to strengthen our confidence that there is a dependence between day hours and emission levels.
Correlation between emission categories and day hours with a particular pattern during the working hours does point to human activity. Next, we will try to strengthen this belief by comparing the emission categories to the week day.
Week day dependence
The 287 days of observations were grouped into the 7 days of the week and labeled starting with 1 for Sunday. Top left graph in Fig. 5 shows the category frequencies in each week day. The remaining seven graphs show, for each week day, the category frequency across the day hours. For example, we see that on Sunday between to 0 and 5 hours, C3 and C4 never appear, and that C3 appears only between 5 and 10 and C4 after 9. Another significant observation is that the hourly category patterns, particularly of C1 and C2 seen in Fig. 4, are also seen for the working days Mondays to Fridays, but the same patterns are less prevalent for Sundays and Saturdays. This last observation strengthens the hypothesis that emissions are governed by human activity.
We did not go beyond the above observations on the week days. There are many points worth investigating further in this; however, certain information that can give important clues was not available to us. The latter includes the particular location of the air-monitor device, whether it is placed in a home picking up the particular activities of the host, and so on.
Noise can be a good indicator of human activity in some cases indicating the type of the activity like the noise near traffic, ambient noise of a town district, noise of a home, etc. In this study, we did not know what flavor of noise to expect. We followed the same approach as in the day hours of Fig. 4. The noise levels are hourly averages expressed in dBA. The noise range was divided into 10 equal sized sub-ranges spanning 6 dBA per range. The frequency of coincidence of the emission categories in each noise range is plotted on the top graph in Fig. 6, where the x-axis represents the noise level (dBA).
The top graph initially suggests a significant correlation. However, unlike the day hours in Fig. 4, the noise levels are not uniformly distributed in the observations and the frequency graph may mislead us to believe a strong correlation. Instead, we examined the probabilities to detect any correlation. These are shown in the next four graphs in Fig. 6. Each graph shows (a) the probability of each noise range given the coincidence of a certain emission category by the blue circles; and (b) the prior probability of each noise range by the red x’s (compare with the probability density in Fig. 1).
We do see a significant mismatch of the two probabilities, especially for Categories 3 and 4. The average noise level for the entire data is 49.5 dBA (Fig. 1). The average noise level in C1 is 49, in C2, 51.3, in C3, 45.9, and in C4, 46.8 dBA.
To exclude random effects in the above results, we ran a quantile-to-quantile comparison in each category. The results are shown in Fig. 7. The left column of graphs shows results when the actual categories were compared, while the right column shows results when random categories were assigned to the hourly data. In each graph, the x-axis represents the noise levels of all observations (unconditional probability) and the y-axis represents the noise levels observed under a certain category. If the two data distributions are independent, then x=y; this condition is shown by the red dashed line in each graph.
Observing the left column, we see that C1 and C2 behave similar to the random results in the right column, albeit the matching error is slightly greater for the actual category. Categories C3 and C4 produced a greater mismatch that could not be achieved with a random category assignment, as seen in the right column. The particular mismatch of C3 and C4 for noise levels over 70 dBA could be possibly explained by a high-noise process producing C2, which masks, C3 and C4.
From the above analysis, we can conclude that emissions do show correlation with noise levels that is not consistent with a random coincidence. More specifically, C3 and C4 tend to appear in less noisy hours.
A similar analysis as above was done to check for temperature dependency. The results are shown in Fig. 8. It is worth noting that Categories 3 and 4 avoid temperatures below 10 and above 27 (C). While Category 2 has the exact opposite tendency. Category 1 appears indifferent to selecting temperatures. The average temperature over the entire observation set is 17.24 (C) (see Fig. 1). As we see from the graphs, C1 yields an average of 17.7 (C), C2, 16.2, C3, 17.8, and C4, 18.1 (C).
A quantile analysis and comparison with random categories was done for temperatures (similar to the one done for noise). The results (not shown here) indicate that the probability mismatches in Fig. 8 are not likely to have occurred randomly.
Clusters of daily emission patterns and their associations
In this section we investigate the daily emission patterns for their similarities. Fig. 3 shows the over all time series of the emission categories and Fig. 9 shows an detail example for a randomly selected day from the data: the graph has 24 hourly points forming a pattern that starts at C4 (00:00 hour), then briefly drops to C3, then remain mostly around C2 for the day hours and rises up to C4 in the evening.
The 6888 hourly emission data were divided into 287 days; each day is a 24-hour pattern of emissions like in Fig. 9. The average of the temperature and the noise level were also computed for each day. Subsequently, the 287 daily patterns were compared for similarities and five clusters were determined as best dividing the days (for some quick reference on clustering see ). These are discussed briefly next, where for each cluster the frequency of week days, average daily temperatures and average daily noise in the cluster are plotted.
Cluster 1 (67.2% of all days are in this cluster)
This is the biggest of the five clusters including 67.2% of the days and has a good coherency between its members. The cluster includes a majority of C1 and C2 categories, with C1 being the most frequent. The top left graph in Fig. 10, shows all daily patterns in the cluster; the black line in the graph shows the average pattern of the cluster. In this cluster the days start with low emissions (average close to C1) and as the day progresses, especially during day hours emissions increase but not dramatically. The remaining three graphs help to identify the association of this cluster with week days, average day temperatures, and average day noise. We see that this cluster has no particular preference to week days, prefers moderate temperatures and rather quieter days.
Cluster 2 (22.3% of all days)
This is the second in size cluster with also good coherency. This cluster also contains mostly C1 and C2 categories. The difference with Cluster 1 in Fig. 10, is that emissions are much higher, closer to C2, during evening and night and they abate during morning and evening. This cluster prefers Saturdays least of the week days, moderate temperatures, and low noise.
Clusters 3 (3.1% of days) and 4 (2.4% of days)
These are two small clusters shown in Fig. 12. Cluster 3 has a small coherency while Cluster 4 has a greater one. Even though the data is too small to draw any general conclusions, it is worth noticing that Cluster 4 avoids Saturdays and appears only in one daily temperature. around 20 (C).
Fig. 12. Clusters 3 and 4.
Cluster 5 (4.9% of days)
This cluster is also small and has the least coherency between its members. Notably, this cluster avoids Sundays and the lower daily temperatures.
Possible predictors of the daily patterns
The previous figures reveal no particular strong dependence between a cluster and the week day, day average temperature and day average noise level, for which we have hoped to be possible predictors of the clusters. The weak dependence can also be seen in the following figure, where no discernible separation of the clusters is seen in the space of the week day, temperature and noise level.
The above result discourage using any linear classifier to detect or predict the daily clusters by knowing the week day, temperature and noise level. Instead a classifier like a neural network or a Kalman filter might be more appropriate. Nonetheless, as a curiosity, we attempted to construct a decision tree using the above three parameters as predictors. In our case, the classes are the 5 daily pattern clusters defined above; the predictors are the week day, average day temperature, and average day noise. Therefore, if these three parameters can be forecast for a given day, the decision tree can tell us, within certain accuracy, which daily emission pattern to expect.
The decision tree is shown in Fig. 15. The tree is 72% accurate (it classified 72% of the test days correctly). Note that Cluster 3 cannot be detected by this tree due to its small significance. For example, starting from the top of the tree, if the daily temperature is 19.78(C) or greater and the week day is later than Monday and the noise level is less than 59.5 dBA, then the most likely cluster is 1. Decision trees, however, should not be relied upon unless we are certain of strong associations (dependence) between the classes and the parameters-predictors.
Conclusions and Recommendations
The study presented here is only a small effort at a much greater task to analyze and quantify the air quality near and around the port of Syros, with the ultimate goals of (a) pointing to specific pollution sources; and (b) being able to forecast pollution levels. The analysis conducted on the data of 8200015E showed a fair degree of correlation of observed emission levels and time and day, showing a tendency towards increasing emission levels during working hours and working days of the week. These two discoveries do point to human activities as a cause; however, more data from multiple sites are needed to increase our confidence that human activities are the main or a significant source of the observed levels of pollutants. We reserve this for a future study when more data from multiple sites will be available.
A weak correlation between the emissions and the hourly temperatures and noise levels was also detected. We note that a strong correlation of noise and emission levels will strengthen the hypothesis of human-caused pollution. At this point, no definite conclusions can be drawn about the connection of noise and emissions until sufficient data is analyzed, especially from multiple sites, including also a clarification of each monitor’s particular situation.
It was also discovered that emissions tend to follow certain daily patterns. These daily patterns consisting of 24 hourly values (from a category scale) were clustered into two major and three minor pattern clusters. Unfortunately, the association of these clusters with other measured data, such as average daily temperature, noise level, and week day, was weak. Therefore, no reliable forecasting of daily emission patterns can be made from these parameters alone.
Forecasting emission levels could be a significant achievement; and this is certainly realistic and doable for the Syros air monitors. The forecasting can be done for each of the pollutants separately or they can be combined into categories similar to what we have done here. Artificial neural networks are a good candidate framework for this purpose. There is enough data to begin training an artificial neural network that could return an hour ahead forecasts or, combined with clustering, even a day ahead forecasts.
Finally, using the collected data from all the sites on Syros and trying to identify a particular source of pollution will be a far greater task, which will require more types of data. These include: (a) weather data such as wind strength and direction and precipitation data; (b) activity data of suspected sources.
 Tasos Matrapazis in Syros Airmon (Public Group), Facebook, 7/15/2019: https://www.facebook.com/groups/739802836418597/permalink/866202563778623? sfns=mo  Google Developers, Machine learning and Clustering, https://developers.google.com/machine-learning/clustering/