5 Principles to Know

7 mins

Data is information that can be used as the basis for analysis. Sometimes data is easy for a human to understand in its raw form (for example, the text in documents could be considered data in the context of certain analysis work, and humans would typically find this easier to process than a computer algorithm). Other times data analysis, exploration and presentation techniques are required before the 'information' (e.g. meaning) within the data is informative to a human.

Data can be stored in varying degrees of ‘structure’, which determines how easily it can be understood by a computer or algorithm. Structured data is information that is highly ordered, typically tabular, can be easily ‘read’ by a computer and exists in predefined formats (often within a database). Semi-structured data is information that is not stored in a tightly defined format but has some level of organisation and standardisation (e.g. tagged images or documents).

03 TA01 0

Unstructured data usually describes information in its native form, namely how it appears in the real world. It has not been abstracted or standardised in a predefined way. Whilst often the most straightforward way for a human to consume information (and representing a large amount of valuable business data), unstructured data needs processing to become machine interpretable. To achieve this, a data scientist will often work to translate various forms of data into a higher degree of structure.

SEE LESS

Data science combines multiple disciplines to extract insights from data in all its forms. It draws on techniques from maths, statistics, computer and information science (amongst others), to provide businesses with accessible and actionable knowledge. It is a highly skilled function or, for larger organisations, multiple functions (e.g. software engineers, data engineers, data analysts and data scientists) that requires technical expertise together with strong communication and business skills.

03 TA02 01

Data science can be used to analyse available data to help with a range of issues:

  • classification algorithms can be used to choose between two or more options (e.g. whether an email is spam or not).
  • anomaly detection algorithms can be used to identify outliers such as unusual actions, events or behaviours (e.g. to detect unusual trades).
  • clustering algorithms help to explain the structure of a data set by looking for similarities between data (e.g. documents with or without indemnities).

Data science can also be used to predict outcomes:

  • regression models help to make numerical predictions (e.g. likely billings for the next quarter or the fee for a given matter).
  • attention-based transformer models can be applied to any type of sequential data (e.g. machine translation of text, chat-bots, image object detection).
  • reinforcement (or semi-supervised) learning models can teach themselves to make decisions based on a predefined environement and goal (e.g. self-driving cars).

SEE LESS

Defining the right problem statement is arguably one of the most important steps of a data science project. It determines the scope of the project, the techniques and technologies to be used (Principle 2), the data that is required and, ultimately, the business value that the project delivers. Spending sufficient time, with the right stakeholders, on this stage of the process is therefore crucial.

A problem statement primarily needs to be capable of delivering the business needs of the project. Formulating the right question is therefore an iterative and collaborative process. It starts with understanding an organisation’s objectives until all stakeholders have a clear understanding of the commercial problem and intended outcomes.

03 TA03 01

Once the business objectives have been established, the problem statement can then be formulated in a way that can be answered by data science. The questions need to be clear, specific and conclusive, rather than broad and ambiguous. This will then allow the data scientist to determine the appropriate techniques to deploy to meet the organisation’s objectives and the type of data that will be needed.

SEE LESS

Data science projects depend on, often very large, data sets. Commonly, the bigger the data set the more reliable the output. However, data quantity is not the only (or necessarily primary) concern. Rather, it is data quality that is critical (DAMA UK provides detailed information on data quality dimensions). Any output is only as reliable as its input (the so-called 'garbage in, garbage out' or ‘GIGO’ problem). In sum, poor data quality risks poor decision-making.

03 TA04 01

The data needs to be complete, current, accurate, unbiased, sufficient and relevant to the problem addressed. Whilst incomplete data may pose hurdles to extracting the necessary insights, an irrelevant or outdated dataset will produce wrong insights. Analysing a company's HR data is unlikely to help its marketing department. In short, the data must accurately reflect the reality of the problem setting and should do this in as much detail as possible.

Efficiencies can be gained if the dataset is formatted in a consistent and valid (or expected) format from the outset. For example, dates formatted yyyy-mm-dd save a data scientist the need to resolve whether each data point is in dd/mm/yyyy or mm/dd/yyyy format. The example may seem trivial but is one of the most common and prevalent data formatting issues, especially if the dataset was collated from more than one region or source and therefore contains a mix of formats.

SEE LESS

Data visualisation is used to present accessible abstractions of the data in a visual (or graphical) manner. This can take many forms, from graphs and charts to interactive dashboards. The format used is normally primarily driven by the nature of the data e.g. spatial, temporal, relative/absolute. The choice of visuals is important to ensure an accurate portrayal of the data set and the relationships between data variables as well as their effective and engaging delivery.

03 TA05 01

These visualisation techniques are critical to the data science process, both as a tool for data exploration as well as conveying findings and supporting the business team in understanding the outputs of the project. As a tool for data exploration, visualisation can be used by the project team to diagnose and cleanse the data set, helping data scientists check, understand and provide initial feedback on the data before calculations are carried out as well as to test model performance and assumptions.

At its core, data visualisation helps make findings accessible and therefore actionable. When done well, it creates a narrative around the data, prioritising key findings as well as bringing relevant patterns, deviations and anomalies to the forefront. This supports efficient and reliable data-driven decision-making. Modern data visuaisations are often presented in interactive dashboards, which allow exploration and active cross-examination of the findings as they are delivered.

SEE LESS

Thank you for signing up for LawtechUK news and updates

You will now be notified of all our latest updates.

Thank you for contacting LawtechUK

We will get back to you as soon as we can.

Thank you for your interest in the Open Legal Data project.

The LawtechUK team will be in touch with you shortly.

Thank you for signing up for LawtechUK news and updates

You will now be notified of all our latest updates.