Aggregate data is high-level data which is acquired by combining individual-level data. For instance, the output of an industry is an aggregate of the firms’ individual outputs within that industry.^[1] Aggregate data are applied in statistics, data warehouses, and in economics.

There is a distinction between aggregate data and individual data. Aggregate data refers to individual data that are averaged by geographic area, by year, by service agency, or by other means.^[2] Individual data are disaggregated individual results and are used to conduct analyses for estimation of subgroup differences.^[2]

Aggregate data are mainly used by researchers and analysts, policymakers, banks and administrators for multiple reasons. They are used to evaluate policies, recognise trends and patterns of processes, gain relevant insights, and assess current measures for strategic planning. Aggregate data collected from various sources are used in different areas of studies such as comparative political analysis and APD scientific analysis for further analyses. Aggregate data are also used for medical and educational purposes. Aggregate data is widely used, but it also has some limitations, including drawing inaccurate inferences and false conclusions which is also termed ‘ecological fallacy’.^[3] ‘Ecological fallacy’ means that it is invalid for users to draw conclusions on the ecological relationships between two quantitative variables at the individual level.^[3]

Applications

In statistics, aggregate data are data combined from several measurements. When data is aggregated, groups of observations are replaced with summary statistics based on those observations.^[4]

In a data warehouse, the use of aggregate data dramatically reduces the time to query large sets of data. Developers pre-summarise queries that are regularly used, such as Weekly Sales across several dimensions for example by item hierarchy or geographical hierarchy.

In economics, aggregate data or data aggregates are high-level data that are composed from a multitude or combination of other more individual data, such as:

in macroeconomics, data such as the overall price level or overall inflation rate; and
in microeconomics, data of an entire sector of an economy composed of many firms, or of all households in a city or region.

Major users

Researchers and analysts

Researchers use aggregate data to understand the prevalent ethos, evaluate the essence of social realities and a social organisation, stipulate primary issues of concern in research, and supply projections in relation to the nature of social issues.^[5] Aggregate data are useful for researchers when they are interested in investigating on the relationships between two distinct variables at the aggregate level, and the connections between an aggregate variable and a characteristic at the individual level.^[2] Researchers have also made an effort to evaluate policies, practices and precepts of systems critically with the assistance of aggregate data, to investigate the corresponding relevance and efficacy.^[5]

Policymakers

Aggregate data are used by governments to develop more effective policies because they serve as a measure of how capable a government is to be aware of the demands and needs of its citizens and a measure of the way a government maintains social order effectively.^[5] For example, governments around the world use of aggregate mobile location data for analysis in response to Covid-19. Aggregate mobile location data could provide insights about the effectiveness of social distancing measures launched by governments. Governments also use aggregate data to identify possible “hot spots” and the potential for transmission.^[6]

As well as projecting effectiveness of government policies, aggregate data analyses are also taken to evaluate the nature, assess the extent, recognise the trend and study the pattern of a specific phenomenon or process with the aim to devise strategies, prepare short- or long-term policies, and take efficacious and relevant procedures for control or prevention.^[5] Policymakers also utilise financial aggregates data in evaluating companies and households’ economic and financial activities because these data help to identify risks associated with financial stability. Policymakers can employ aggregate data to better understand the developments of a country’s economic and financial conditions.^[7]

Banks

Banks collect aggregated data from a significant number of customers and then anonymise the data through eliminating personal information. The main reason for banks to use aggregate data is to estimate economic trends and gain insights on customer clusters. Banks are not permitted to share customers’ personal data, but aggregate data can be shared with banks’ business customers and can be accessed by other partners who also use the same platform to acquire information on aggregate data.^[8]

In Australia, the Commonwealth Bank provides its business clients anonymised data related to their customers which are derived from card transactions. The ANZ also provides its business customers with anonymised data which is gathered from millions of merchant terminal transactions and ANZ card transactions.^[8]

In the UK, the Integrated Urgent Care Aggregate Data Collection (IUC ADC) provides comprehensive information about IUC activity, its performance, as well as its service demand. Its data are sourced from the lead data providers responsible for offering integrated urgent care services in England.^[9] The National Health Service (NHS) under the Department of Health and Social Care (DHSC) in England stated that this collection of aggregate data is going to replace the NHS 111 minimum dataset. It will also be used as a formal source for IUC statistics, as well as to oversee the Key Performance Indicators (KPIs) of the IUC ADC.^[10]

Administrators

National or regional level of available empirical data are used by administrators and intellectuals, as well as people who are concerned about a region or a society’s welfare, as sources of reference.^[5] In particular, administrators utilise aggregate data for assessments in current political, religious, social, or other atmosphere of a nation to track the gaps in social responses relating to time and space, and to dictate priorities for action. These assessments help administrators in evaluating current measures that are useful in future strategic planning and provide indicators about effective corrective measures.^[5]

Sources and collection methods

Aggregate data can be a composition of various types of writings and records, including biography, autobiography, descriptive accounts and correspondence.^[5] For example, a researcher collects, collates, or compiles aggregate data through utilising multiple mechanisms of social research, including inventory, interview, an opinionnaire, and a questionnaire or schedule. Official or non-official agencies also collect and compile aggregate data on an ongoing basis through utilising infrastructures available within a department at the field level.^[5]

Sources of aggregate data can also be regarded as tools for discovering data. In the US, some of the US data are presented in the form of tables. Examples of sources for these US aggregate data include the United States Census Bureau, Statistical Abstract of the United States, and Social Explorer. International Monetary Fund data, World DataBank, and Penn World Table are examples of transactional and international aggregate data sources.^[11]

Use of aggregate data

Comparative political analysis

Aggregate data is used in comparative political analysis because analysts do not only focus on individual’s behaviour. They also focus on the behaviour of areal units, including electoral constituencies and nations.^[12] In political activity analyses, significant data such as those related to industrialisation, urbanization, as well as mass communication networks, are not expressed readily in individual levels. They are expressed in per capita terms in order to control for the variations in the areal units’ population size.^[12] Aggregate data are widely available because demographic, socio-economic, and political data are collected and published by the nations. This facilitates researchers and analysts in carrying out longer trend studies and allows them to bring changes and developments in a deeper focus.^[12]

APD scientific meta-analyses

Factors including the need for time, considerable resources and wide international cooperation, impeded the use of individual patient data (IPD) meta-analysis, which led to most of the published meta-analyses relying upon aggregate patient data (APD).^[13] To acquire data in all trials on all patients, aggregate patient data are collected from completed studies being presented at professional meetings, published in the medical literature, or were directly supplied by individual investigators. The aggregated patient data are utilised by users including the Cochrane Collaboration, the United States Preventive Services Task Force, and multiple professional societies in providing support for clinical practice guidelines. Aggregate patient data are also used in time-to-event studies of meta-analyses as the results can inform investors about the worthiness to proceed to conducting more meta-analyses that are based on resource-intensive individual patient data.^[13]

Other uses

Health care

In a health information system, aggregate data is the integration of data concerning numerous patients. A particular patient cannot be traced based on aggregate data. These aggregated data are only counts, including Tuberculous, Malaria, or other diseases. Health facilities use this type of aggregated statistics to generate reports and indicators, and to undertake strategic planning in their health systems.^[14] Compared with aggregated data, patient data are individual data related to a single patient, including one’s name, age, diagnosis and medical history. Patient-based data are mainly used to track the progress of a patient, such as how the patient responds to particular treatment, over time.^[14]

The COVID-19 Data Archive, also called the COVID-ARC, aggregates data from studies around the globe. Researchers are able to have access towards the discoveries of international colleagues and forges collaborations to facilitate processes involved in fighting against the disease.^[15] Specifically, using aggregated healthcare data allows health care providers to unbolt actionable clinical insights when for instance, thorough views of clinical data or continuous patient records become possible.^[15]

Education

Aggregate data such as aggregate school-level demographic data and aggregate school-level achievement data are used in experimental analysis to assess the relationships between student achievement and school-level interventions.^[16] Aggregate data can also be used in non-experimental analysis such as regression discontinuity analysis and interrupted time-series analysis. Individual-level data are not required in these non-experimental analyses. For example, interrupted time-series analysis estimates the impact brought by a school-level program through comparing a school’s achievement before and after the program is launched where individual-level data are not necessary.^[16]

Limitations

During the process of averaging units within some cluster or within a country, information is lost which increases the probability of drawing inaccurate inferences.^[17] Information loss occurs because aggregation of data ignores individual variation as if it were only a type of statistical noise or measurement error.^[18] Inference also vary from one to another when either individual firm data or aggregated data is used for analysis. For instance, calculation of country averages does not account for firm-specific variables, such as firm size, firm age, or firm-ownership concentration, but calculation of individual averages does. Differences exist between results generated from aggregate data and individual data.^[17]

There is also a problem of ‘ecological fallacy’. The concept was brought about by Robinson (1950). The meaning of the term is that the variability around the individual-level means is significantly different from the variability encompassing the aggregate means.^[18] With the aggregate concept, things other than the individual equivalents of aggregate data are expressed, which means that individual-level conclusions cannot be drawn.^[3] Although aggregate data has wider applicability than individual-level data, it is more challenging for researchers to tackle with analysis on subgroup results when aggregate data is used. Eventually, individual information may also be required. Growth modelling and longitudinal modelling based on aggregate data are also difficult because variables can vary over time.^[2]

Other types of aggregate data

Financial aggregates data

Financial aggregates data is a type of aggregate data about credit and the money supply in Australia, which is utilised by policymakers in evaluating both the households and the companies’ economic and financial activities.^[7]

Credit aggregates

Credit aggregates are measurements of the households and businesses’ borrowings from financial intermediaries. The amount of funds borrowed by businesses for purposes including project investments, assets purchases, or cash flow managements are also measured using credit aggregates.^[7]

Monetary aggregates

Monetary aggregates are measurements of the money or ‘money-like’ instruments of the banking system, which is owed to businesses and households. An example of a ‘money-like’ instrument is deposits in the bank account.^[7]

Census aggregate data

In the UK, census aggregate data are data generated as outputs from the United Kingdom censuses. They provide information about the socio-economic and demographic characteristics of the country’s population. They are a compilation of aggregated, or summarised, calculations of the number of individuals, household residents, or families in particular geographic areas with specific characteristics, or compounds of characteristics, taken from the subjects of people and places, populations, families, health, ethnicity and religion, housing and work.^[19]

Aggregate data are used as components of the UK censuses’ outputs. They are obtained from analysis on the information given in the census returns.^[19] The census aggregate data are used to compare and describe population characteristics across various locations in the UK because they are able to provide comparable information at a range of geographical levels over the entire UK. Census aggregate data are also utilised in the academic sector for teaching and research purposes, as well as for site location and marketing in the private sector.^[19]

References

^ Hashimzade, Nigar; Myles, Gareth; Black, John (2017-01-19). A Dictionary of Economics. Oxford University Press. p. 4. doi:10.1093/acref/9780198759430.001.0001. ISBN 978-0-19-875943-0.
^ ^a ^b ^c ^d Jacob, Robin (2016). "Using Aggregate Administrative Data in Social Policy Research". Office of Planning, Research & Evaluation | ACF. pp. 1–6. Retrieved 2020-10-30.
^ ^a ^b ^c Starrin, Bengt; Hagquist, Curt; Larsson, Gerry; Svensson, Per-Gunnar (1993-06-01). "Community types, socio-economic structure and IHD mortality—A contextual analysis based on Swedish aggregate data". Social Science & Medicine. 36 (12): 1569–1578. doi:10.1016/0277-9536(93)90345-5. ISSN 0277-9536. PMID 8327920.
^ Aggregation and Restructuring of data (chapter 5.6 from the book "R in Action", Manning Publications)
^ ^a ^b ^c ^d ^e ^f ^g ^h Shukla, K. S. (1982). "Analysis of Aggregate Data". Journal of the Indian Law Institute. 24 (4): 756–762. ISSN 0019-5731. JSTOR 43950840.
^ "Mobile Location Data and Covid-19: Q&A". Human Rights Watch. 2020-05-13. Retrieved 2020-10-30.
^ ^a ^b ^c ^d Bank, Joel; Durrani, Kassim; Hatzvi, Eden (21 March 2019). "Updates to Australia's financial aggregates". Reserve Bank of Australia.
^ ^a ^b Stewart, Emily (2019-03-22). "Banks have lots of information about you — and they don't keep it all to themselves - ABC Life". ABC News. Retrieved 2020-10-30.
^ "Statistics » Integrated Urgent Care Aggregate Data Collection (IUC ADC) Experimental Statistics 2019-20". www.england.nhs.uk. NHS England. Retrieved 2020-10-30.
^ "Integrated Urgent Care Aggregate Data Collection (IUC ADC) for March 2020 (Experimental)". GOV.UK. England, United Kingdom. 14 May 2020. Retrieved 2020-10-30.
^ Pencek, Bruce. "Research Guides: Data resources for social science: Aggregate data". guides.lib.vt.edu. Virginia Tech. Retrieved 2020-10-30.
^ ^a ^b ^c Retzlaff, Ralph H. (1965). "The Use of Aggregate Data in Comparative Political Analysis". The Journal of Politics. 27 (4): 797–817. doi:10.2307/2128120. ISSN 0022-3816. JSTOR 2128120. S2CID 154713056.
^ ^a ^b Lyman, Gary H.; Kuderer, Nicole M. (2005-04-25). "The strengths and limitations of meta-analyses based on aggregate data". BMC Medical Research Methodology. 5 (1): 14. doi:10.1186/1471-2288-5-14. ISSN 1471-2288. PMC 1097735. PMID 15850485.
^ ^a ^b "3.5 Difference between Aggregated and Patient data in a HIS". docs.dhis2.org. Retrieved 2020-11-15.
^ ^a ^b Greenbaum, Zara (19 August 2020). "Scientists launch data archive to bolster research on COVID-19". HSC News. Retrieved 2020-10-31.
^ ^a ^b Jacob, Robin T.; Goddard, Roger D.; Kim, Eun Sook (2014-03-01). "Assessing the Use of Aggregate Data in the Evaluation of School-Based Interventions: Implications for Evaluation Research and State Policy Regarding Public-Use Data". Educational Evaluation and Policy Analysis. 36: 44–66. doi:10.3102/0162373713485814. S2CID 145621485.
^ ^a ^b Holderness, Clifford G. (2016-05-12). "Problems Using Aggregate Data to Infer Individual Behavior: Evidence from Law, Finance, and Ownership Concentration". Critical Finance Review. 5 (1): 1–40. doi:10.1561/104.00000028.
^ ^a ^b Pollet, Thomas V.; Stulp, Gert; Henzi, S. Peter; Barrett, Louise (2015). "Taking the aggravation out of data aggregation: A conceptual guide to dealing with statistical issues related to the pooling of individual-level observational data". American Journal of Primatology. 77 (7): 727–740. doi:10.1002/ajp.22405. ISSN 1098-2345. PMID 25810242. S2CID 1705139.
^ ^a ^b ^c "Census aggregate data guide". census.ukdataservice.ac.uk. Retrieved 2020-10-31.

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test (normal) Student's t-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging