How to collaborate with small data
Aktualisiert: 19. Okt. 2021
"Big Data" might be the greatest buzzword of all time. Opinions on what it is, what it isn't, where it starts and where it ends completely depends on whom you talk to. All things equal, the global big data analytics market is projected to grow with a compound annual growth rate of almost 30 percent over the next few years years, with revenue reaching over 68 billion U.S. dollars by 2025 from around 15 billion U.S. dollars in 2019 . While big data continues to claim much of the buzz, small data is a hidden gem which remains often forgotten during talks about data challenges and potentials. Thanks to recent technological advancements, more people than ever are able to collaborate around a distributed ecosystem of information, an ecosystem of small data. Not to add yet another buzzword to the team, let's briefly clarify what we mean by the term "data collaboration"
What is data collaboration?
Data collaboration, also referred to as data conferencing, involves two or more participants working on uncovering business insights or data context in one or between multiple data sets. Common reasons for collaborating on data is to find connections, relationships, correlations, or root causes reflected by the data source. Data collaboration works with structured and unstructured data and for small as well as for big data. The technical focus and the methods used for the collaboration define which collaboration tools should be put to use. Depending on the actual requirement, data collaboration can take place anywhere along the skill spectrum, involving either technical staff or business staff exclusively at the extremes but most frequently coming to fruition when technical and functional personnel works together. Equivalently, collaboration can happen on different levels of granularity, ranging from code level up to discussions entered around only plots and charts.
From this definition, we can broadly distinguish four different types of collaboration.
Excel Jungle Data Collaboration
Business Experts building reports in Excel or other low code file formats based on static source data excerpts. Collaboration and knowledge exchange is needed to keep track of the logic as well as to clarify questions about provenance and quality.
Technical Data Collaboration
BI Experts, Data Engineers and Data Scientist talking about data pipelines and related challenges on a very technical level.
Dashboard Process Data Collaboration
Business Experts sharing data insights in a process-driven and dashboard-centric way.
Exploratory Data Collaboration
Data Scientists searching through data samples for insights, correlations, and dependencies prior to commit to concrete next steps concerning data requirements and modelling.
Our take-away: Data collaboration comes in many shapes and forms. No matter what, having healthy data collaboration routines in place leaves companies better off.
Benefit and challenges of data collaboration
For organizations, regular, intensive, data collaboration significantly advances the value of data. Insights are uncovered in a targeted or casual manner that would go unrecognized if processed in isolation. This can significantly increase competitive advantage and data. 49%  of companies report experiencing insufficient data quality or lack of contextualization of their enterprise data. Without contextualisation, data remains just ... data. By collaborating well on data, these grievances can be uncovered and remedied much more quickly and with significantly less frustration. Yet, successful data collaboration is either said than done. In fact, 43% of all companies see cross-company understanding as the biggest hurdle to adopting data projects. Establish good data collaboration practices remains an item on the bucket list of many companies.
Small Data vs. Big Data
Well, I either have a lot of data or just a few rows, right?! That's a very casual way of looking at things. Reality is a bit more colourful. The difference between Small Data and Big Data manifests along various dimensions.
Big Data usually comes into the picture when we talk of Terra Bytes of data. But for many of us, even a 100 GB dataset moves out of reach without access to the right technologies. Yet, it’s sometimes hard to find a hard cut between what's big and what's small in practice. As a rule of thumb, everything around 10 GB is usually not accessible to an average person, without struggle or access to the right technology and skill. Also, underlying tables or spreadsheet might be quite big and unhandy, preventing users to digest all information easily without further processing and visualizations. But a 10GB data set would not really fit the normal “Big Data” definition and is still accessible enough so that a normal person could use it with tools like an Excel on steroids or query it through a sqlite database. Our suggestion: Let’s call the small data projects of these days small big data projects. These data quantities are reality for many people who work with data at work.
A rocky path to value - Small Data Challenges
“If you take the top 100 biggest innovations of our time, perhaps around 60% to 65% percent are really based on Small Data.” - Martin Lindstrom
Believe it or not, small data has pulled the strings behind many innovations. If we read about the latest and greatest breakthroughs using big data, rest assured that most of this pioneering work once started with extensive trial and error and adventurous explorations of smaller datasets. While many data challenges emerge independent of size, some issues are even more significant for Small Data and therefore require fundamentally more attention.
Representation and variability
Small data sets only reflect reality to a limited extent. In statistics, we speak of samples and the smaller a sample, the greater the probability that this sample looks like it does by chance. A well-known example to explain this phenomenon is a coin. The probabilities for heads and tails are logically 50% each. Let's pretend we don't know that. If I now want to use data to determine what the probability is, I need to flip the coin and record the results. Suppose I flip the coin three times and get heads twice and tails once. Then my data would tell me that the probability is heads 2/3 and for tails 1/3. This does not correspond to reality and is due to the fact that I only have a small sample, which was generated by chance. However, if I flip the coin 30 times, I get close to the expected 50% per side. Therefore, small data should always be processed and consumed with a certain caution in order to draw conclusions from it that are to be recognized as generally valid.
Small data may be subject to biased recording. Suppose we want to record the temperature variations of a machine that runs on a day shift and a night shift. To record it performance, we send an intern to go to the machine 3x each day and read the gauges. One check in the morning, a second check in afternoon and a final look in the evening just before he leaves. This data is of limited help for a comprehensive analysis because we only record the data at certain, almost fixed times, in a range from 9am - 5pm. The night shift is also not covered, as the intern is only there during the day. If another part is produced on the machine at night, and thus runs in a different configuration, then that is not reflected in the data, and the data we want to analyze is skewed in favor of a specific time.
Recoding Errors and Data Quality
Small datasets are often (really not always) created manually or semi-manually. This often results in spelling errors (good -> god), value errors (1 -> !) or free text fields. This significantly reduces the data quality, which can lead to unusability for meaningful evaluation and processing due to the small amount of data.
What all these challenges have in common is that they demand collaboration and communication to alleviate their impact. Concerns about representation and variability can only be soothed if we have a word with the person who generated the data or is in the known of its provenance. Similarly, only a word with the intern in charge of reading the gauges will allow us to estimate the size of bias which creeped into the data at hand.
Lastly, spotting data quality issues is often easier than coming up with a sensible way of correcting them. Exchanging with a business owner who can provide insight on how "healthy" data should distribute or look like is often more advisable than compromising valuable patterns through naive arithmetic imputations. Here are some tips we found very useful when attempting collaboration on small data in your organisation.
Methods to collaborate on small data
Heterogeneity of participants: As obvious as it sounds, when talking about data it is extremely valuable to invite people from different backgrounds who have a different perspective on the data than you do. If you are an expert in the core business, it is valuable to get other process participants and also people from IT on board. Especially for small data sets, analysts and statisticians are also a great asset. These people don't have to be super deep in the subject and sometimes with a quick look at the data can raise interesting questions that will get you further.
Data Canvas before recoding: If you have the chance, you should create a project-specific data canvas with your team before you start collecting data at scale to commit your analysis to. A canvas will help you gain an overview of what data is valuable, what exactly the data should show, and in what format it needs to be collected. For this, the Data Project Canvas by Daan Kolkman is a good choice, or the ML Canvas by OWNML. In the end, there are different versions, and it is not so important which one exactly you adopt as long as you pick one to clear the fog ahead of you before you start running.
Data review: Since abnormal data values impact small data even more than big data, spend time in reviewing, cleaning, and managing the small data asset. This means detecting outliers, imputing missing values or deciding how to use them, and understanding impact of measurement errors.
Data Process Map: Domain expertise might be the best to master big and small datasets alike. Since your small data can have strong problems with bias and representation you should use prior experience and domain expertise. A great way to do that is through a data process map. You can invite domain experts how the business process works. By drawing the process, you can map your columns and data records to the different stages of the process, which will allow you to get the big picture and to ask the expert questions about the data. Often knowledge is isolated in the heads of people, but by drawing the big picture and pin down your data next to it, you will find more insights than you might have expected.
Don’t make it formal: Use creative methods and whiteboard structures to connect the dots. This brings more fun for all participants and allows everybody ate take part, feel comfortable and think outside the box.
But what are data collaboration tools, worth exploring? Data collaboration becomes hard especially when different profiles with different skill sets need to live collaborate on data. Well, who would we be if we didn't think detective is the best option you have. It allows you for state of the art realtime collaboration while providing full data access to your data, no matter if big or small or in what format. Using our canvas-powered user interface, collaborative methods such as the data process map can be created with ease. Above all, everything happens in one place. Excited to learn more about data collaboration and the detective platform? Sign up to our newsletter or book in a demo session with us right away.
 Liu, S. (2021) - Big data analytics market revenue worldwide in 2019 and 2025 - https://www.statista.com/statistics/947745/worldwide-total-data-market-revenue/
 Seyfert, S., Schlömer, L., Schiborr, L. A., Dr. Bange, C., & Krüger, T. (2018) - biMA Studie 2017/18. Hamburg: Sopra Steia SE.- https://www.soprasteria.de/docs/librariesprovider2/sopra-steria-de/infografiken/infografik-studie-bima.pdf?sfvrsn=3d155fdc_6  Statista (2020) - Biggest challenges to big data adoption among corporations in the United States and worldwide, as of 2019 - https://www.statista.com/statistics/742983/worldwide-survey-corporate-big-data-adoption-barriers/  https://knowledge.wharton.upenn.edu/article/small-data-new-big-data/  https://www.researchgate.net/figure/The-original-Dutch-version-of-the-Data-project-canvas_fig1_331373918  https://www.ownml.co/machine-learning-canvas