Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards Higher Code Quality in Scientific Computing
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computational Science.
2021 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In scientific computing and data science, computer programs employing mathematical and statistical models are used for obtaining knowledge in different application domains. The results of these programs form the basis of among other things scientific papers and important desicions that may e.g. affect people's health. Consequently, correctness of the programs is of great importance. To reduce the risk of defects in the source code, and to not waste human resources, it is important that the code is maintainable, i.e. not unnecessarily hard to analyze, test, modify or reuse. For these reasons, this thesis strives towards increased maintainability and correctness in code bases for scientific computing and data science.

Object-oriented programming is a programming paradigm that facilitates writing maintainable code, by providing mechanisms for reuse and for division of code into smaller components with restricted access to each others data. Further, it makes extending a code base without changing the existing code possible, increasing flexibility and decreasing the risk of breaking existing functionality. However, in many cases, object-orientation trades its benefits for performance. For some scientific computing programs, performance is essential, e.g. because the results are unusable if they are produced too late. In the first part of this thesis, it is shown that object-oriented programming can be used to improve the quality of an important group of scientific computing programs, with only a small impact on performance.

The aim of the second part of the thesis is to contribute to understanding of, and improve quality in, source code for data science. A large corpus of Jupyter notebooks, a tool frequently used by data scientists for writing code, is studied. Results presented suggest that cloned code, i.e. identical or close to identical code that recurs in different places, is common in Jupyter notebooks. Code cloning is important from a perspective of maintenance as well as for research. Additionally, the most frequently called library functions from Python, the language used in the vast majority of the notebooks, are studied. A large number of combinations of parameters for which it is possible to pass values that may lead to unexpected behavior or decreased maintainability are identified. The existence and consequences of occurrences of such combinations of values in the corpus are evaluated. To reduce the risk of future defects in source code calling these functions, improvements are suggested to the developers of the functions.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2021. , p. 52
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 2000
National Category
Computational Mathematics Software Engineering
Research subject
Scientific Computing
Identifiers
URN: urn:nbn:se:uu:diva-430134ISBN: 978-91-513-1107-4 (print)OAI: oai:DiVA.org:uu-430134DiVA, id: diva2:1514980
Public defence
2021-02-26, Zoom, 09:15 (English)
Opponent
Supervisors
Available from: 2021-02-03 Created: 2021-01-07 Last updated: 2021-03-04
List of papers
1. Impact of Code Refactoring using Object-Oriented Methodology on a Scientific Computing Application
Open this publication in new window or tab >>Impact of Code Refactoring using Object-Oriented Methodology on a Scientific Computing Application
2020 (English)Report (Other academic)
Publisher
p. 30
National Category
Software Engineering Computational Mathematics
Research subject
Scientific Computing
Identifiers
urn:nbn:se:uu:diva-429905 (URN)
Note

This is an extended version of the paper "Impact of Code Refactoring using Object-Oriented Methodology on a Scientific Computing Application" published at IEEE International Workshop on Source Code Analysis and Manipulation 2014.

Available from: 2021-01-05 Created: 2021-01-05 Last updated: 2021-01-07
2. Performance of an OO compute kernel on the JVM: Revisiting Java as a language for scientific computing applications (extended version)
Open this publication in new window or tab >>Performance of an OO compute kernel on the JVM: Revisiting Java as a language for scientific computing applications (extended version)
2019 (English)Report (Other academic)
Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2019-007
National Category
Software Engineering Computational Mathematics
Identifiers
urn:nbn:se:uu:diva-396931 (URN)
Available from: 2019-11-04 Created: 2019-11-12 Last updated: 2024-05-29Bibliographically approved
3. Jupyter Notebooks on GitHub: Characteristics and Code Clones
Open this publication in new window or tab >>Jupyter Notebooks on GitHub: Characteristics and Code Clones
2021 (English)In: The Art, Science, and Engineering of Programming, E-ISSN 2473-7321, Vol. 5, no 3, article id 15Article in journal (Refereed) Published
Abstract [en]

Jupyter notebooks has emerged as a standard tool for data science programming. Programs in Jupyter notebooks are different from typical programs as they are constructed by a collection of code snippets interleaved with text and visualisation. This allows interactive exploration and snippets may be executed in different order which may give rise to different results due to side-effects between snippets. Previous studies have shown the presence of considerable code duplication – code clones – in sources of traditional programs, in both so-called systems programming languages and so-called scripting languages. In this paper we present the first large-scale study of code cloning in Jupyter notebooks. We analyse a corpus of 2.7 million Jupyter notebooks hosted on GitHJub, representing 37 million individual snippets and 227 million lines of code. We study clones at the level of individual snippets, and study the extent to which snippets are recurring across multiple notebooks. We study both identical clones and approximate clones and conduct a small-scale ocular inspection of the most common clones. We find that code cloning is common in Jupyter notebooks – more than 70% of all code snippets are exact copies of other snippets (with possible differences in white spaces), and around 50% of all notebooks do not have any unique snippet, but consists solely of snippets that are also found elsewhere. In notebooks written in Python, at least 80% of all snippets are approximate clones and the prevalence of code cloning is higher in Python than in other languages. We further find that clones between different repositories are far more common than clones within the same repository. However, the most common individual repository from which a Jupyter notebook contains clones is the repository in which itself resides.

Keywords
Jupyter notebooks, Mining software repositories, Code cloning, Software analytics
National Category
Software Engineering Computational Mathematics
Identifiers
urn:nbn:se:uu:diva-429906 (URN)10.22152/programming-journal.org/2021/5/15 (DOI)
Funder
eSSENCE - An eScience Collaboration
Available from: 2021-01-05 Created: 2021-01-05 Last updated: 2024-06-28Bibliographically approved
4. To Err or Not to Err?: Subtle Interactions Between Parameters for Common Python Library Functions
Open this publication in new window or tab >>To Err or Not to Err?: Subtle Interactions Between Parameters for Common Python Library Functions
(English)In: Article in journal (Other academic) Submitted
Keywords
Jupyter notebooks, Data Science, Python, Error suppression
National Category
Software Engineering Computational Mathematics
Research subject
Scientific Computing
Identifiers
urn:nbn:se:uu:diva-430133 (URN)
Projects
eSSENCE - An eScience Collaboration
Available from: 2021-01-07 Created: 2021-01-07 Last updated: 2024-09-17

Open Access in DiVA

fulltext(558 kB)920 downloads
File information
File name FULLTEXT01.pdfFile size 558 kBChecksum SHA-512
80641b70aecc7de3ca2c2e93db4e370f55501dfbd4d0295e1745fe899b077d97f731087c8e95a6fca374ace47960dc43bff887147acb738813800646681ed961
Type fulltextMimetype application/pdf

Other links

Online defence

Authority records

Källén, Malin

Search in DiVA

By author/editor
Källén, Malin
By organisation
Division of Scientific ComputingComputational Science
Computational MathematicsSoftware Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 921 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1820 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf