How important are Data Structures in Software Development and Data Science?

Someone asked me: when do we apply data structures in software development and data science?

This basic question is important for many reasons, but most importantly, it makes one appreciate the all-so-important concept of data structures in a practical way. My answer to the above question was that I use data structures all the time I am coding the backend in a web-application or a data science project. Perhaps the answer could easily be that I use it all the time I am coding.

Today, I want to talk about the importance of data structures in software development and data science based on my day-to-day work. I want to expose, using sample codes, instances where I am using data structures.

First off, currently I do backend development and data wrangling, analytics, and reporting mostly using Python. Previously, I developed applications using Java, and even Groovy. I only used C++ as part of my educational projects. I remember writing a text editor then for a statistical analysis package using C++, but that was more like a toy project. In the earlier stages of my career, my colleagues and I would also sometimes question the importance of data structures (and algorithms) in practical software development. In hindsight, I wonder why.

In this post, I will demonstrate that data structures are used virtually every time you are writing code for a project. I will illustrate using Python/Flask, but this specificity is inconsequential. The principle should apply across the board.

For you to comfortably grasp this submission, you should be familiar with data types and data structures (at least the basic ones) in Python or any other language.

Python has the following primitive data types: int, float, string, boolean. A date in Python is not a data type of its own, but we can import a module named datetime to work with dates as date objects.

Data Structures are built around the primitive data types, and they provide a means for organizing and efficiently accessing data in a program.

In Python, we have the following standard Data Structures, and they are all containers:

1. string (sequence, immutable), e.g ‘richard orama’
2. list (sequence, mutable, heterogeneous), e.g. [2, 6, ‘richard’, (1, 2)]
3. tuple (sequence, immutable, heterogeneous), e.g. (10, 20, 30)
4. dictionary (collection, mutable, key/value pairs), e.g. {‘name’: ‘orama’, ‘programming_languages’: [‘python’, ‘java’, ‘javascript’], ‘databases’: (‘mysql’, ’postgresql’)}
5. set (collection, mutable, unique members, unordered) {11, 55, ‘kampala’}

Being containers, these data structures can contain one or a combination of primitive data types or another container. For example, a list can have membership of int, string, tuple.

Other Python data structures can also be constructed: queue, stacks, etc - but we shall not reach that extent.

While developing a web application, I find that I use lists and dictionaries a lot (if not all the time). I use tuples less often, but still considerably. I use sets less often, but in ways that should not be ignored because when I use a set, there is no other option that I can use in that situation other than a set.

I will now show you some examples of when I use data structures so that you can appreciate their use cases. To take the point home, I am picking random samples of Flask functions or endpoints just to demonstrate that data structures are applied everywhere. The snippets should be self-explanatory and informative enough even if you develop in another language.

@app.route('/select_districts', methods=['GET'])
@exception_handler
def select_districts():
    """
    Select all the records useful for populating drop-downs
    """

    district_list = District.query.filter(District.enabled == 1).with_entities(
        District.id, District.record_name).all()
    list_zip = list(zip(*district_list))
    district_list_id = list_zip[0]
    district_list_name = list_zip[1]
    result = {'districtListId': district_list_id, 'districtListName': district_list_name}
    escape_exclusions = ['aHtml']
    return jsonify({k: v if (not isinstance(v, str) or k in escape_exclusions) else escape((v)) for k, v in result.items()})

In the above snippet:

district_list and list_zip are lists
district_list_id and district_list_name are list elements
result is a dictionary
the return value (json) is a dictionary.
zip(district_list) is a zip object, which is an iterator (in this case list) of tuples

def read_data(report_name, values_matrix, aggfunc_matrix):

    sql_fields = request.args.get('sqlFields')
    sql_from = request.args.get('sqlFrom')
    dimensions = []

    dimensions = map(lambda x: 'T.'+x.strip(),
                     session.get('report_dimensions').split(',')[1:]) #leave out None

    # converting map to list works only once, subsequent times it becomes empty list
    dimensions_list = list(dimensions)

    sql_text = f"select {sql_fields} from {sql_from} where 1"

    if not validate_sql(sql_text, 'select'):
        return (None, None)

    if session['DBMS'] == 'MySQL':
        # mysql
        with pymysql.connect(host='localhost', port=3306, user='grails', password=os.environ.get('PASSWORD_MYSQL'), db='nomad1') as dbConnection:
            df = pd.read_sql(sql_text, dbConnection)
    elif session['DBMS'] == 'SQLite':
        sql_text = sql_text.replace("concat(c.textfield2, ' ', c.textfield1)", "c.textfield2 || c.textfield1")
        # Create a SQL connection to our SQLite database
        sqlite_file = os.path.join(basedir, app.config['SQLITE_FOLDER'], 'database.db')
        with sqlite3.connect(sqlite_file) as dbConnection:
            df = pd.read_sql(sql_text, dbConnection)

    return (df, [s.replace('T.', '') for s in dimensions_list])

In the above snippet:

request.args and session are dictionaries, ...

@logging
def report_matrix(**kwargs):
    ''' matrix report
    '''

    df = kwargs['df']
    report_group_rows = kwargs['report_group_rows']
    report_group_cols = kwargs['report_group_cols']
    values_matrix = kwargs['values_matrix'].replace('t.', '')  # remove 'T.'
    aggfunc_matrix = kwargs['aggfunc_matrix']

    # could use generator instead of list comprehension - simply change [] to () - its more efficient for large datasets
    index = [c for c in report_group_rows if c != 'None']
    # make unique. cannot use set since it is unordered
    index = list({x: 'Dummy' for x in index if x != 'None'})

    columns = [c for c in report_group_cols if c != 'None']
    # make unique. cannot use set since it is unordered
    columns = list({x: 'Dummy' for x in columns if x != 'None'})

    df['Join_Quarter'] = df['Join_Date'].dt.quarter
    columns[-1] = 'Join_Quarter' #replace last with this

    #get pivot for ach group, but remove unnecessary column totals (uing iloc[:, :-1])
    report = df.groupby(index[0]).apply(lambda sub:
        sub.pivot_table(
            index=index,
            columns=columns,
            values=[i.strip() for i in values_matrix.split(',')][0],
            aggfunc=aggfunc_matrix, #[np.sum, len], # aggfunc_matrix,
            fill_value=0,
            dropna=True,
            margins=True,
            margins_name=f'SubTotal {sub.name}'
        ).iloc[:, :-1] #remove pivot column subtotals
    ).sort_index(axis=1)
    #row summaries
    number_levels = report.index.nlevels
    row_max = ['', 'Max']
    row_min = ['', 'Min']
    row_total = ['', 'Total']
    for _ in range(number_levels-2):
        row_max.append('')
        row_min.append('')
        row_total.append('')

    report1 = report[~report.index.get_level_values(1).str.contains('Total')] #exclude totals
    report.loc[tuple(row_max)] = report1.max()
    report.loc[tuple(row_min)] = report1.min()
    report.loc[tuple(row_total)] = report1.sum()
    #column summaries
    report['ColumnMax'] = report.max(axis=1)[:-3] #ignore last three emelemts of series
    report['ColumnMin'] = report.drop(columns=['ColumnMax']).min(axis=1)[:-3] #ignore last three emelemts of series
    report['ColumnTotal'] = report.drop(columns=['ColumnMax', 'ColumnMin']).sum(axis=1)

    #cleanup
    report.index = report.index.droplevel(0)

    report.rename(columns = {values_matrix: '',}, inplace = True)
    report = report.apply(pd.to_numeric) # convert all columns of DataFrame, but it re-introduces nan, so replace nan again below
    report = report.round(3)

    #report = report.iloc[:, :-4] #remove unwanted columns
    report.drop(['ColumnMax', 'ColumnMin'], axis=1, level=0, inplace=True) #dropping  colum from multi-index columns

    return report

In the above snippet, many of the variables may represent any of the container data structures. It would be useful to go through and assign an appropriate data structure as an exercise. I am sure you can see that the above code snippets are full of data structures.It is just data structures. If you are not proficient in handling data structures, then you may not comprehend the logic.

Orama's Data Blog

sub-title

Wednesday, 23 March 2022

How important are Data Structures in Software Development and Data Science?

No comments:

Post a Comment