In March 2021, the Board of Directors of Universidad del Valle de Guatemala (UVG), authorized the creation of the Data Science Lab, as a unit within the Center for Studies in Applied Informatics (CEIA in Spanish). One of the goals of this new unit is to collect, store and preserve as much data as possible, generated in Guatemala. With this information, an open data repository will be created; which means that the data will be available without restrictions, with the condition of citing the source and sharing.

The new Data Science Lab is an innovative proposal because, in addition to working with small databases, it will be one of the pioneers in working with Big Data in the country and making tools available to Guatemalans to work with.

What is Big Data?

The term Big Data refers to large and complex data sets, which are so voluminous that greater computational resources are required to work with them. Despite its popularity, there is still no consensus on its definition, so we can simplify it by saying that: if the resources available are not enough to process the data, it is Big Data. 

Big Data is characteristic of the 21st century: it is estimated that currently 1.7 Megabytes of data are generated per second, per person in the world.  Despite their complexity, these massive volumes of data can be used to identify and find solutions to problems that were once unsolvable. In other words, Big Data provides a benchmark.

The “three Vs” of Big Data

The idea of ​​the “three Vs” responds to the characteristics necessary for Big Data to be relevant. These are:

Volume: The amount of data we have matters because it dictates the resources needed  to model and process it.

Velocity: This characteristic refers to the rate at which data is received and some action is applied to it. Sometimes this data is acquired in real-time, which requires evaluation and action at the same speed.

Variety: This refers to the different types of data available. In the past, conventional data was structured and could be clearly organized in a relational database. With the increase of data, it is more difficult to structure elements such as text, audio, or video.

Frequently  two other Vs are mentioned: value and veracity. These respond to the fact that the data has an intrinsic value. However, it is of no use until that value is discovered. To be of any value, the data must be usable, and this depends on its preservation. 

It is equally important to ensure that the data comes from reputable and reliable sources. For this reason, CEIA intends to fill a void in the country, by creating the Data Science Lab as a point of reference for researchers and companies from various sectors to obtain relevant data.

The Data Science Lab projects

The Data Science Lab will begin its operations with 2 projects, which will be available to the public:

  • Data LakeA centralized repository that allows storing structured and unstructured data at any scale. Data can be stored as is, without first having to structure it. Associated with this repository there will be metadata management tools and an interactive grapher.
  • Guatemala in Data: This project is being developed in conjunction with the Sustainable Economic Observatory (OES in Spanish) of the UVG and seeks to be a reference data platform for Guatemala. With it, one will be able to find the most relevant data available on the national reality. The platform will be freely accessible, the data may be freely downloaded, the content contribution will be collaborative and it will be of the highest rigor and impartiality.

With the creation of the Data Science Lab, the CEIA aims to help the conservation and maintenance of valuable data, with which in the long term more research can be carried out that will help the development of the country. This data laboratory and the information it will have are key tools to understand our reality and act on it.