Representation of Data at Scale

Representation of Data at Scale

Representation of Data at Scale

Data representation can be defined as the selection of a mathematical structure that models data in a particular fashion. The selection as to what type of structure will be utilized (i.e. how data will be modeled) is determined by myriad elements, some of which include hardware, communication, input-output, and noise.

Furthermore, the complexion of data representation differs depending on a variety of factors, such as: the current task, the condition in which the data are obtained, the feature of data analysis being considered, and what facets of data are currently being acquired.

Examining the Numerous Uses of Data Representation

As noted, on an aggregated level, data representation is intricate and complex. Thus, for purposes of consistency and organization, it is often necessary to distinguish between the numerous uses of the term.

Basic data structures refer to arrangements such as hash tables, inverted indices, etc. Generally, this term refers to the basic structures that can be found in textbooks, particularly those relating to algorithms or databases.

More abstract, but still relatively basic, mathematical structures includes arrangements such as sets, vectors, metric spaces, and graphs. Such structures differ from basic data structures in their intricate, multi-faceted nature (e.g. with more abstract structures comes opportunity of multi-structural representation, with each execution referencing different operations and metrics.

Derived mathematical structures refer to more advanced structures, such as clusters, linear projections, data samples, etc. Such are commonly basic mathematical structures, but differ in that they are not direct representations of the native data. Instead, derived mathematical structures are obtained from the data, and can be modeled as representational components.

Representation of Data at Scale

General Goals of Data Representation

The main objective in representing data is to provide meaningful data or statistics that can thus be utilized efficiently to aid in the performance of tasks at hand. Therefore, a quality representation of data can essentially provide an infinite amount of benefit by allowing for a colossal amount of data processing and analysis duties to be complete in an extremely efficient manner.

Challenges of Data Representation

Complex objectives often bring complex challenges. And while there are numerous factors that contribute to proper data representation, some of the most important include:

Reducing Computation

Due to the massive amounts of data, it is an essential that computations operate as efficiently as possible. Yet in order to due this, it is key to understand which structures are suitable for various objectives. Such understanding comes through initiating various operations, such as querying data and sequential processing algorithms.

Reducing Storage and/or Communication

In order to properly represent data, it is imperative that storage and/or communication are kept at a minimum. For example, the storage of data requires communication between network(s). Yet, similar to linguistics, communication that is not direct and concise creates potential for misinterpretation or misconstruction. By reducing storage and/or communication, all that is left is the data that is absolutely essential to the objectives – nothing more, nothing less.

Reducing Statistical Complexity

Statistical complexity is referred to in Frontiers in Massive Data Analysis as, “the amount of data needed to solve a given statistical task with a given level of confidence.” Therefore, if one is concise with their selection of data attributes and parameters, they will naturally refine the standard of statistical inference. Also, it should be noted that taking such approach often leads to a decrease in storage and running time.

Exploratory Data Interpretation

Exploratory data interpretation generally refers to processing initial data examinations in order to acquire enough comprehension to formulate a proper hypothesis of the data. Considering that this task is apart of the initial process, one might need only use very simple statistics in order to generate an appropriate hypothesis. Also, note that within exploratory data interpretation comes a more fundamental understanding of the data, thus providing clues on how data should be analyzed moving forward, or whether further inspection is even necessary.

Data Visualization

Due to how the brain processes information, it is often easier for individuals to derive meaning from data if it is presented and organized in a visual format. Yet, as the amount of data that we create and analyze continues to increase, it becomes progressively difficult to condense and present data in traditional formats. To deal with this, data is being presented in new, more innovative formats such as 3d graphs and interactive maps.


In dealing with massive scales of data, sampling plays such a key component that questions surrounding Big Data often directly refer to how the data should be sampled. In sampling Big Data, there are essentially two basic methods. The first assumes that data has already been collected and therefore is interested in conducting computation on the data. The second method deals with the actual collection of data, and is used primarily when dealing with data so large that collecting and analyzing the entire amount of data is unrealistic.

The Future of Data Representation

The data industry has grown exponentially and experts project that this growth will continue until 2020. Thus said, data representation will play a crucial role in almost every sector of our society. Models and structures that are properly established based upon the principle of reduction for clarity and efficiency will be integral in the utilization of data. Also, considering the projected boom, it is important to stay astutely observant of data in order to identify potential improved methods of representation.