Finding non-English newspapers in Trove¶

There are a growing number of non-English newspapers digitised in Trove. However, if you're only searching using English keywords, you might never know that they're there. I thought it would be useful to generate a list of non-English newspapers, but it wasn't quite as straightforward as I thought.

How not to do it...¶

My first thought was I could start by searching for digitised newspapers amongst the library records in Trove. My theory was that catalogue metadata would include language information. For example, you can search for newspapers using format:Periodical/Newspaper in the books and libraries category (or the article API zone). To find those that are digitised, you can add a search for 'trove.nla.gov.au'. Here's the sort of results you get. Unfortunately, you only get about 826 results and there are many more newspapers than that in Trove. It seems links to digitised newspapers are not consistently recorded.

My second approach was to get the list of digitised newspapers from the API, extract the ISSN, then use this to search for catalogue records. Here's the code snippet I used.

params = {
    'zone': 'article',
    'encoding': 'json',
    'l-format': 'Periodical/Newspaper',
    'reclevel': 'full',
    'key': TROVE_API_KEY
}
newspapers = get_newspapers()
for newspaper in newspapers:
    print(f'\n{newspaper["title"]}')
    issn = newspaper.get('issn')
    params['q'] = f'issn:{issn}'
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
    data = response.json()
    try:
        works = data['response']['zone'][0]['records']['work']
    except KeyError:
        print('Not found')
    else:
        for work in works:
            print(work.get('language'))
    if not response.from_cache:
        time.sleep(0.2)

The main problem here is that not all titles have ISSNs. You could try searching on the titles is there's no ISSN, but this would involve a fair bit of disambiguation. In any case, in running this I discovered that while there is some language information in the metadata, it's not consistently applied. So basically a metadata-only approach is not going to work. Sigh...

How I actually did it¶

If I couldn't get language details from metadata, then I had to try and extract it from the resource itself. I spent quite a bit of time looking around for Python packages that provided reliable language detection. The first one I tried regularly identified Mandarin as Korean (it turns out this was a known issue). Another one sent me into dependency hell. Finally I found pycld3 which installed with pip, and just worked.

My plan was to get the list of newspapers via the API as before, then fire off an empty search for each one. I'd then loop through the results, running the language detector over the article text. I set the query parameters to retrieve the maxmimum number of results in one request – 100. That seemed like a reasonable sample. To try and provide a big enough amount of text for the language detector to work with, I set the number of words parameter to return articles with between 100 and 1000 words. So the query parameters I used were:

params = {
    'category': 'newspaper',
    'encoding': 'json',
    'l-word': '100 - 1000 Words',
    'include': 'articletext',
    'n': 100,
}

Because some of the newspapers had short runs and the word count filter limits the results, I found that I wasn't always getting 100 results per newspaper. To work around this I found the likely language for each article, aggregated the counts, and then calculated the proportion of results for each language. This gave me the proportion of articles in each language – a number I could use across newspapers to find the non-English titles.

In general this worked pretty well, and the result was a list of 55 newspapers that have significant amounts of non-English content. However, I had to do a fair bit of fiddling to filter out dodgy results. All the details are included below.

Problems / limitations¶

  • It's no surprise that the results of the language detection are affected by the quality of the OCR.
  • In filtering out what seems to be the product of dodgy OCR, it's possible that I might be excluding some non-English content.
  • I'm only detecting the predominant language for each article, so there might be articles containing a mix of languages that are being missed.
  • I'm just talking the first 100 results from a blank search in each newspaper. Larger, or more randomised samples might produce different results.
  • Some dodgy detection results remain in the list of newspapers, but the point of this exercise was to find non-English newspapers. If you wanted to accurately determine the quantity of non-English content, you'd have to do a lot more fine-grained analysis.

Import what we need¶

In [51]:
import os
import re
from collections import Counter
from datetime import datetime, timedelta
from pathlib import Path

import altair as alt
import pandas as pd
import requests_cache
from dotenv import load_dotenv
from IPython.display import display
from language_tags import tags
from py3langid.langid import MODEL_FILE, LanguageIdentifier
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

s = requests_cache.CachedSession(expire_after=timedelta(days=30))
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))

load_dotenv()
Out[51]:
True
In [52]:
# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

headers = {"X-API-KEY": API_KEY}

Harvest the data and run language detection on articles¶

In [53]:
def get_newspapers():
    """
    Get a list of newspapers in Trove.
    """
    response = s.get(
        "https://api.trove.nla.gov.au/v3/newspaper/titles",
        params={"encoding": "json"},
        headers=headers,
    )
    data = response.json()
    return data["newspaper"]
In [54]:
def find_languages(sample_size=None):
    params = {
        "category": "newspaper",
        "encoding": "json",
        # 'l-category': 'Article',
        "l-word": "100 - 1000 Words",
        "include": "articletext",
        "n": 100,
    }
    newspaper_langs = []
    newspapers = get_newspapers()
    identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
    for newspaper in tqdm(newspapers[:sample_size]):
        langs = []
        # print(f'\n{newspaper["title"]}')
        params["l-title"] = newspaper["id"]
        response = s.get(
            "https://api.trove.nla.gov.au/v3/result", params=params, headers=headers
        )
        data = response.json()
        n = data["category"][0]["records"]["n"]
        try:
            articles = data["category"][0]["records"]["article"]
        except KeyError:
            # print('Not found')
            pass
        else:
            # Detect language for each article in results
            for article in articles:
                if "articleText" in article:
                    # Clean up OCRd text by removing tags and extra whitespace
                    text = article["articleText"]
                    text = re.sub(r"<[^<]+?>", "", text)
                    text = re.sub(r"\s\s+", " ", text)
                    # Get the language
                    lang, prob = identifier.classify(text)
                    # If the language prediction is reliable, save it
                    # if ld.is_reliable:
                    if prob >= 0.95:
                        langs.append(lang)
            # Find the count of each language detected in the sample of articles
            for lang, count in dict(Counter(langs)).items():
                # Calculate the language count as a proportion of the total number of results
                prop = int(count) / len(langs)
                newspaper_langs.append(
                    {
                        "id": newspaper["id"],
                        "title": newspaper["title"],
                        "language": lang,
                        "proportion": prop,
                        "number": n,
                    }
                )
    return newspaper_langs

Convert the results into a dataframe.

In [ ]:
newspaper_langs = find_languages()
df = pd.DataFrame(newspaper_langs)
df.head()

Add full language names¶

The language detector returns BCP-47-style language codes. To translate these into something that's a bit easier for humans to understand, we can use the language-tags package.

In [56]:
def get_full_language(lc):
    """
    Get full language names from codes
    """
    lang = tags.description(lc)
    if lang:
        return lang[0]
    else:
        print(lc)
        return lc


df["language_full"] = df["language"].apply(get_full_language)

Filtering the results¶

If we just look at the numbers of languages detected we might think that Australia's cultural diversity was much greater than we expected! But the likelihood that there were ten newspapers publishing articles in Igbo (the language of the Igbo people in south-eastern Nigeria) seems small. Obviously there are a considerable number of false positives here.

In [57]:
df["language_full"].value_counts()
Out[57]:
language_full
English                    1786
Latin                       195
Luxembourgish                40
Aragonese                    31
Welsh                        22
Italian                      20
Lao                          20
Albanian                     16
German                       14
Swahili (macrolanguage)      14
Hebrew                       10
Chinese                       9
Northern Sami                 9
Tagalog                       8
Afrikaans                     8
Breton                        6
Portuguese                    5
Norwegian                     5
Quechua                       4
Armenian                      4
Faroese                       4
Modern Greek (1453-)          4
Japanese                      3
Bosnian                       3
French                        3
Polish                        2
Dutch                         2
Spanish                       2
Amharic                       2
Slovak                        2
Lithuanian                    2
Kinyarwanda                   2
Vietnamese                    2
Occitan (post 1500)           2
Croatian                      2
Javanese                      1
Indonesian                    1
Irish                         1
Haitian                       1
Belarusian                    1
Icelandic                     1
Ukrainian                     1
Romanian                      1
Estonian                      1
Danish                        1
Malagasy                      1
Walloon                       1
Esperanto                     1
Swedish                       1
Turkish                       1
Maltese                       1
Macedonian                    1
Kirghiz                       1
Slovenian                     1
Serbian                       1
Name: count, dtype: int64

Remember that for each language detected in a newspaper we calculated the proportion of articles in our results set in that language. So we can, for example, just look at newspapers where 100% of the articles are in a single language. This highlights a few non-English language newspapers, but obviously we're missing a lot of others.

In [58]:
df.loc[df["proportion"] == 1]["language_full"].value_counts()
Out[58]:
language_full
English                 1481
German                     3
Chinese                    3
Hebrew                     3
Modern Greek (1453-)       2
Italian                    2
Estonian                   1
Name: count, dtype: int64

If we chart the proportions, we see them bunched up at either end of the scale. So there are lots of languages detected in only a small proportion of articles.

In [59]:
alt.Chart(df).mark_bar().encode(x=alt.X("proportion:Q", bin=True), y="count():Q")
Out[59]:

If we zoom in on the proportions less than 0.1 (that's 10 articles in a sample of 100) we see that they're mostly less that 0.01 (or 1 article in 100). It seems likely that these are false positives.

In [60]:
alt.Chart(df.loc[df["proportion"] < 0.1]).mark_bar().encode(
    x=alt.X("proportion:Q", bin=True), y="count():Q"
)
Out[60]:

Let's be fairly conservative and filter out languages that have a proportion (per newspaper) less than 0.5. This list seems a bit more in line with what we would expect, but there are still some surprises – 34 newspapers published articles in Latin?

In [61]:
df.loc[df["proportion"] >= 0.05]["language_full"].value_counts()
Out[61]:
language_full
English                    1775
Latin                        33
Italian                      15
Chinese                       9
German                        9
Aragonese                     6
Lao                           5
Hebrew                        5
Luxembourgish                 4
Modern Greek (1453-)          4
Portuguese                    3
French                        3
Swahili (macrolanguage)       3
Welsh                         3
Lithuanian                    2
Dutch                         2
Norwegian                     2
Bosnian                       2
Polish                        2
Indonesian                    1
Tagalog                       1
Estonian                      1
Quechua                       1
Walloon                       1
Swedish                       1
Danish                        1
Ukrainian                     1
Albanian                      1
Esperanto                     1
Japanese                      1
Spanish                       1
Macedonian                    1
Name: count, dtype: int64

If we focus in on the newspapers that supposedly have a significant proportion of articles in Maltese, we see some very strange results. I seriously doubt that 80% of the Mildura Irrigationist from 1892-3 is in Maltese. So what's going on?

In [62]:
df.loc[(df["proportion"] > 0.1) & (df["language_full"] == "Latin")]
Out[62]:
id title language proportion number language_full
229 1596 L'Italo-Australiano = The Italo-Australian (Su... la 0.148936 100 Latin
273 350 Nepean Times (Penrith, NSW : 1882 - 1962) la 0.110000 100 Latin
748 190 Windsor Express and Richmond Advertiser (NSW :... la 0.130000 100 Latin
855 1207 The Coolangatta Chronicle (Qld. : 1926) la 0.153846 26 Latin
1023 34 The Advertiser (Adelaide, SA : 1889 - 1931) la 0.171717 100 Latin
1602 706 The Chinese Advertiser (Ballarat, Vic. : 1856) la 0.200000 10 Latin
1619 685 The English and Chinese Advertiser (Vic. : 185... la 0.227273 22 Latin
1672 1583 The Mildura Irrigationist (Vic. : 1892 - 1893) la 0.208333 100 Latin
1678 1581 The Mildura Irrigationist and Murray River Agr... la 0.189474 100 Latin
1691 1733 The Morwell Advocate and Boolara and Mirboo Ch... la 0.238095 21 Latin
1887 1623 Geraldton Express and Murchison Goldfields New... la 0.111111 100 Latin
2055 1402 The Central Districts Advocate (Goomalling, WA... la 0.110000 100 Latin

If you look at results for the Mildura Irrigationist in Trove you'll see that many of the page images are blurry, and as a result the OCR is very, very bad. Here's a sample:

1KB JEWk'L CA8R. Mr. fWanw wiw latwjcht aft at llw .PaliiMi Ckact» tiMlty ini anaavi|vh af oMaioint wowf ^ bbrpmaaMMM. Mr plitdf I pillf, a«4 araa mlrwnl fa miMF atoailw |mml wrritadr. la thk «saa» Mr. Dakar— w«ltor)pMl ariifc . baTiqt oMiiwil • Mini of mawj fratn y Mi tot. Uptnk ami On. farJtanrfaiarkicth »Wrad«l«- Iroai Major and Kit. liar . gnai«i Mm. CMiwim «a ako coat aaillvd for I rial on tHurp >4 prjtiy, alkynl in hi* lawti raoimitimiIwr •u 'K<«. tW action for drfamatmn of «Imi«cirr vhkli lamiflit afaii^t Major ! ami Mi. Hritnp«r». in txme^mncr of Uwr MMiini thai aim M «ol««i mww valuaUr (ran tWir m«- ilfw. Ma}«r arid Mr*. Ilargreatw apfwakd lo tW Itrndt tor un'.i'(jr.

What happens when we feed this fragment of bad OCR to the language detector? Remarkably, the language detector is sure that it's Latin! To find out why this is the case, we'd probably have to dig into the way the language detection model was trained. But for our purposes it's enough to know that some of the languages detected seem to be the result of bad OCR.

In [63]:
ocr = """1KB JEWk'L CA8R.
Mr*. fWanw wiw latwjcht aft at llw
.PaliiMi Ckact» tiMlty ini anaavi|vh af
oMaioint wowf ^ bbrpmaaMMM. Mr
plitdf I pillf, a«4 araa mlrwnl fa
miMF atoailw |mml wrritadr. la thk
«saa» Mr*. Dakar— *w«ltor)pMl ariifc
.
baTiqt oMiiwil • Mini of mawj fratn
y Mi tot. Uptnk ami On. farJtanrfaiarkicth
»Wrad«l«- Iroai Major and Kit. liar
. gnai«i Mm. CMiwim* «a ako coat
aaillvd for I rial on tHurp >4 prjtiy,
alkynl in hi* lawti raoimitimiIwr
•u 'K<«. tW action for drfamatmn of «Imi«cirr
vhkli lamiflit afaii^t Major
! ami Mi*. H*ritnp«*r». in txme^mncr
of Uwr MMiini thai aim M «*ol««i
mww valuaUr (ran tWir m«-
ilfw. Ma}«r arid Mr*. Ilargreatw
apfwakd lo tW Itrndt tor un'.i'(jr.
"""
identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
identifier.classify(ocr)
Out[63]:
('la', np.float32(1.0))

Of course there might actually be newspapers in unexpected languages, so we don't want to filter them all out. Instead let's do some manual inspection of the newspapers that seem to have non-English content. First we'll filter our results to include only languages with proportions of more than 0.05, and then drop out newspapers that seem to be only in English. We end up with 100 different titles.

In [64]:
# The filter on the groupby drops out newspapers that only have articles in English.
filtered = (
    df.loc[df["proportion"] >= 0.05]
    .groupby(by=["title", "id"])
    .filter(lambda x: (len(x) > 1) or (len(x) == 1 and x["language"] != "en"))
)
papers = filtered.groupby(by=["title", "id"])
len(papers)
Out[64]:
100

Let's list those 100 newspapers. From the list below, I think it's pretty easy to pick out the results that are likely to be the product of bad OCR.

In [65]:
for n, l in papers:
    if not l.loc[(~df["language"].isin(["en"])) & (df["proportion"] >= 0.05)].empty:
        print(f"\n{n[0]} ({n[1]})")
        display(
            l[["language_full", "language", "proportion"]]
            .loc[(l["proportion"] > 0.05)]
            .sort_values(by="proportion", ascending=False)
        )
A Voz de Timor (Dili, East Timor : 1970 - 1975) (1498)
language_full language proportion
9 Portuguese pt 0.919192
Adelaider Deutsche Zeitung (SA : 1851 - 1862) (277)
language_full language proportion
915 German de 1.0
Australier Leben = Australian Life (Melbourne, Vic. : 1931 - 1933) (1686)
language_full language proportion
1256 Hebrew he 1.0
Australijos Lietuvis = The Australian Lithuanian (SA : 1948 - 1956) (1876)
language_full language proportion
920 Lithuanian lt 0.97
Australische Zeitung (Adelaide, SA : 1875 - 1916) (1150)
language_full language proportion
922 German de 1.0
Bangkok Recorder (Thailand : 1865 - 1867) (1488)
language_full language proportion
14 English en 0.939394
15 Portuguese pt 0.050505
Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946) (1283)
language_full language proportion
17 Indonesian id 0.99
Chinese Republic News (Sydney, NSW : 1914 - 1937) (1186)
language_full language proportion
100 Chinese zh 0.97
Chinese Times (Melbourne, Vic. : 1902 - 1922) (705)
language_full language proportion
1293 Chinese zh 1.0
Chung Wah News (Perth, WA : 1981 - 1987) (1383)
language_full language proportion
1842 English en 0.50
1841 Chinese zh 0.49
Cobden Times (Vic. : 1918) (543)
language_full language proportion
1297 English en 0.91
1298 Latin la 0.08
Daily Post (Hobart, Tas. : 1908 - 1918) (860)
language_full language proportion
1113 English en 0.77
1114 Aragonese an 0.20
Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952) (1385)
language_full language proportion
1867 German de 0.82
1868 English en 0.18
Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906) (1600)
language_full language proportion
149 German de 1.0
Deutsche Zeitung fur Sud-Australien = German Times for South Australia (Tanunda, SA : 1851) (1577)
language_full language proportion
935 German de 0.9
934 English en 0.1
Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939) (1591)
language_full language proportion
150 German de 0.71
151 English en 0.29
Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851) (1576)
language_full language proportion
936 German de 0.99
Dutch Australian Weekly (Sydney, NSW : 1951 - 1993) (1044)
language_full language proportion
156 Dutch nl 0.979592
Dutch Weekly (Sydney, NSW : 1993 - 2004) (1045)
language_full language proportion
159 Dutch nl 0.939394
160 English en 0.060606
Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952) (1384)
language_full language proportion
1872 Polish pl 0.88
1873 English en 0.12
Eco Italiano (Perth, WA : 1958 - 1959) (1387)
language_full language proportion
1874 Italian it 0.979592
Geraldton Express and Murchison Goldfields News (WA : 1894 - 1896) (1623)
language_full language proportion
1886 English en 0.585859
1888 Welsh cy 0.303030
1887 Latin la 0.111111
Geraldton Murchison Telegraph (WA : 1892 - 1899) (1625)
language_full language proportion
1893 English en 0.92
1894 Welsh cy 0.06
Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923) (704)
language_full language proportion
186 Chinese zh 1.0
Hellenic Echo (Perth, WA : 1967 - 1968) (1389)
language_full language proportion
1913 Modern Greek (1453-) el 1.0
Hobart Town Advertiser : Weekly Edt. (Tas. : 1859 - 1865) (1739)
language_full language proportion
1129 English en 0.95
Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957) (1378)
language_full language proportion
1915 Italian it 0.96
Il Giornale Italiano (Sydney, NSW : 1932 - 1940) (279)
language_full language proportion
197 Italian it 0.91
198 English en 0.09
Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954) (1601)
language_full language proportion
199 Italian it 0.75
200 English en 0.25
Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940) (1602)
language_full language proportion
211 English en 0.80
212 Italian it 0.15
Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935) (1603)
language_full language proportion
214 English en 0.85
215 Italian it 0.14
Italo-Australian (Sydney, NSW : 1927 - 1940) (1595)
language_full language proportion
217 Italian it 0.97
Japanese Perth Times (Subiaco, WA : 1989 - 1996) (1386)
language_full language proportion
1918 Japanese ja 0.96
Kookynie Advocate and Northern Goldfields News (WA : 1903 - 1904) (1455)
language_full language proportion
1931 English en 0.92
1932 Latin la 0.07
Kyabram Union (Vic. : 1886 - 1894) (196)
language_full language proportion
1418 English en 0.93
1419 Latin la 0.07
L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885) (1596)
language_full language proportion
227 Italian it 0.702128
229 Latin la 0.148936
230 Aragonese an 0.063830
228 Quechua qu 0.053191
L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909) (1597)
language_full language proportion
234 Italian it 0.97
La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984) (1388)
language_full language proportion
1937 Italian it 0.98
Le Courrier Australien (Sydney, NSW : 1892 - 2011) (829)
language_full language proportion
238 French fr 0.76
239 English en 0.24
Mediterranean Voice (Perth, WA : 1971 - 1972) (1390)
language_full language proportion
1955 Modern Greek (1453-) el 0.333333
1954 English en 0.232323
1956 Portuguese pt 0.161616
1950 French fr 0.080808
1949 Spanish es 0.060606
Meie Kodu - Our Home (Sydney, NSW : 1949 - 1956) (280)
language_full language proportion
248 Estonian et 1.0
Menzies Weekly Times (WA : 1897 - 1898) (1636)
language_full language proportion
1961 English en 0.89
Moruya Examiner (NSW : 1881 - 1902) (1882)
language_full language proportion
255 English en 0.908163
257 Latin la 0.061224
Musu Pastoge = Our Haven (Sydney, NSW : 1950 - 1954) (1594)
language_full language proportion
263 Lithuanian lt 0.95
Nasza droga (Adelaide, SA : 1952 - 1954) (1323)
language_full language proportion
964 Polish pl 0.89
965 English en 0.11
Nepean Times (Penrith, NSW : 1882 - 1962) (350)
language_full language proportion
272 English en 0.88
273 Latin la 0.11
Norden (Melbourne, Vic. : 1914 - 1918) (797)
language_full language proportion
1460 Danish da 0.642857
1462 Norwegian no 0.153061
1464 Swedish sv 0.091837
1461 English en 0.061224
North Melbourne Gazette (Vic. : 1894 - 1901) (384)
language_full language proportion
1468 English en 0.94
Oceania (Sydney, NSW : 1913 - 1915) (1598)
language_full language proportion
282 Italian it 0.54
283 English en 0.46
Revue Australienne : Journal des Interets Francais en Australie ... (Sydney, NSW : 1873 - 1874) (1604)
language_full language proportion
302 French fr 0.98
Sandringham Southern Cross (Vic. : 1914 - 1918) (318)
language_full language proportion
1525 English en 0.939394
1527 Latin la 0.050505
Seamen's Strike Bulletin (Melbourne, Vic. : 1919) (1043)
language_full language proportion
1529 Chinese zh 0.2
1530 Lao lo 0.2
1531 Norwegian no 0.2
1532 Albanian sq 0.2
1533 Bosnian bs 0.2
South Sydney News (NSW : 1940) (1854)
language_full language proportion
315 English en 0.944444
316 Latin la 0.055556
Southern Morning Herald (Goulburn, NSW : 1920 - 1923) (418)
language_full language proportion
319 English en 0.885417
320 Latin la 0.083333
Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932) (1380)
language_full language proportion
2024 Italian it 0.98
Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851) (314)
language_full language proportion
1018 German de 0.888889
1019 English en 0.111111
Sunday News (Sydney, NSW : 1919) (623)
language_full language proportion
323 English en 0.878788
324 Latin la 0.080808
Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959) (1379)
language_full language proportion
2030 Italian it 1.0
Sydney General Trade List (NSW : 1834 - 1842) (694)
language_full language proportion
334 English en 0.908163
336 Latin la 0.051020
Sydney General Trade List, Mercantile Chronicle and Advertiser (NSW : 1830) (696)
language_full language proportion
340 English en 0.888889
341 Tagalog tl 0.111111
Sydney General Trade List, and Mercantile Advertiser (NSW : 1829 - 1830) (695)
language_full language proportion
338 English en 0.913043
339 Latin la 0.086957
Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874) (278)
language_full language proportion
1016 German de 0.99
The Advertiser (Adelaide, SA : 1889 - 1931) (34)
language_full language proportion
1021 English en 0.595960
1022 Luxembourgish lb 0.222222
1023 Latin la 0.171717
The Advertiser (Hobart, Tas. :  1837 - 1840) (1736)
language_full language proportion
1158 English en 0.93
1160 Walloon wa 0.06
The Australian Jewish News (Melbourne, Vic. : 1935 - 1999) (1685)
language_full language proportion
1572 English en 0.77
1573 Hebrew he 0.23
The Australian Jewish Post (St. Kilda, Vic. : 1966 - 1968) (1777)
language_full language proportion
1574 Hebrew he 1.0
The Bee of Australia (Sydney, NSW : 1844) (1011)
language_full language proportion
386 English en 0.923077
387 Italian it 0.061538
The Broughton Creek Register, and Kangaroo Valley and South Coast Farmer (Berry, NSW : 1886 - 1890) (1888)
language_full language proportion
427 English en 0.94
428 Aragonese an 0.06
The Brunswick and Coburg Leader (Vic. : 1914 - 1929) (293)
language_full language proportion
1595 English en 0.94
The Central Districts Advocate (Goomalling, WA : 1922 - 1924) (1402)
language_full language proportion
2054 English en 0.87
2055 Latin la 0.11
The Chinese Advertiser (Ballarat, Vic. : 1856) (706)
language_full language proportion
1601 Chinese zh 0.8
1602 Latin la 0.2
The Coolangatta Chronicle (Qld. : 1926) (1207)
language_full language proportion
854 English en 0.846154
855 Latin la 0.153846
The Derby News (WA : 1887) (1617)
language_full language proportion
2072 Luxembourgish lb 0.5
2073 English en 0.5
The English and Chinese Advertiser (Vic. : 1856 - 1858) (685)
language_full language proportion
1620 Chinese zh 0.772727
1619 Latin la 0.227273
The Gippsland Farmers' and Glengarry, Toongabbie and Cowwarr Journal (Traralgon, Vic. : 1922 - 1923) (1870)
language_full language proportion
1628 English en 0.94
1629 Aragonese an 0.06
The Goldfields Observer (Kalgoorlie, WA : 1930 - 1939) (1626)
language_full language proportion
2100 English en 0.878788
2101 Latin la 0.090909
The Herald of Tasmania (Hobart, Tas. : 1845) (1741)
language_full language proportion
1179 English en 0.9
The Hobart Town Daily Mercury (Tas. : 1858 - 1860) (33)
language_full language proportion
1189 English en 0.929293
1190 Latin la 0.060606
The Jewish Post (Melbourne, Vic. : 1949 - 1966) (1776)
language_full language proportion
1640 Hebrew he 1.0
The Jewish Weekly News (Melbourne, Vic. : 1933 - 1935) (1707)
language_full language proportion
1641 English en 0.81
1642 Hebrew he 0.19
The Mildura Irrigationist (Vic. : 1892 - 1893) (1583)
language_full language proportion
1670 English en 0.427083
1672 Latin la 0.208333
1673 Luxembourgish lb 0.187500
1671 Swahili (macrolanguage) sw 0.145833
The Mildura Irrigationist and Murray River Agricultural Times (Vic. : 1888) (1581)
language_full language proportion
1676 English en 0.473684
1678 Latin la 0.189474
1679 Luxembourgish lb 0.136842
1677 Swahili (macrolanguage) sw 0.094737
1680 Lao lo 0.094737
The Mildura Irrigationist and Murray River Cultural Advocate (Vic. : 1891 - 1892) (1582)
language_full language proportion
1682 English en 0.87
1683 Swahili (macrolanguage) sw 0.07
The Morwell Advocate and Boolara and Mirboo Chronicle (Vic. : 1886) (1733)
language_full language proportion
1692 English en 0.714286
1691 Latin la 0.238095
The Morwell Advocate and Narracan, Boolara and Mirboo Chronicle (Vic. : 1886) (1734)
language_full language proportion
1694 English en 0.917526
1696 Latin la 0.051546
The Mount Ararat Advertiser (Vic. : 1857) (1883)
language_full language proportion
1699 English en 0.916667
1700 Latin la 0.083333
The Reporter (Box Hill, Vic. : 1889 - 1925) (244)
language_full language proportion
1718 English en 0.938776
1717 Latin la 0.051020
The Richmond River Express and Casino Kyogle Advertiser (NSW : 1904 - 1929) (500)
language_full language proportion
604 English en 0.846939
606 Lao lo 0.051020
The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957) (1381)
language_full language proportion
2215 Modern Greek (1453-) el 0.98
The Yarrawonga Mercury and Mulwala (N.S.W.) News (Vic. : 1882 - 1892; 1894 - 1897) (1863)
language_full language proportion
1762 English en 0.89
1763 Latin la 0.07
To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954) (1592)
language_full language proportion
706 Modern Greek (1453-) el 1.0
Tung Wah News (Sydney, NSW : 1898 - 1902) (1185)
language_full language proportion
713 Chinese zh 1.0
Tung Wah Times (Sydney, NSW : 1901 - 1936) (1184)
language_full language proportion
714 Chinese zh 0.99
Uniamoci (Sydney, NSW : 1903 - 1904) (1599)
language_full language proportion
725 Italian it 1.0
Upper Hunter Courier (Murrurundi, NSW : 1871) (810)
language_full language proportion
726 English en 0.928571
727 Lao lo 0.071429
Vesnik (Perth, WA : 1975 - 1994) (1382)
language_full language proportion
2249 Macedonian mk 0.412371
2248 English en 0.340206
2251 Bosnian bs 0.144330
Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954) (1593)
language_full language proportion
728 Ukrainian uk 0.82
729 English en 0.18
Warwick Daily News (Qld. : 1919 -1954) (892)
language_full language proportion
898 English en 0.887755
899 Latin la 0.081633
Williamstown Trade Circular (Vic. : 1855 - 1856) (213)
language_full language proportion
1805 English en 0.888889
1806 Esperanto eo 0.111111
Windsor Express and Richmond Advertiser (NSW : 1843 - 1844) (190)
language_full language proportion
747 English en 0.87
748 Latin la 0.13

I went through the titles above and compiled a list of title identifiers that seem to be producing dodgy results. We can use this to filter these newspapers out of our results.

In [66]:
# Titles where dodgy OCR causes false positives in language detection
# This was manually created after scanning results
dodgy = [
    "1036",
    "1043",
    "1011",
    "1103",
    "116",
    "1207",
    "1265",
    "13",
    "1320",
    "1336",
    "140",
    "1400",
    "1402",
    "145",
    "1455",
    "1488",
    "1543",
    "1546",
    "1581",
    "1582",
    "1583",
    "1617",
    "1623",
    "1625",
    "1626",
    "1636",
    "1638",
    "1675",
    "1678",
    "171",
    "1733",
    "1734",
    "1739",
    "1741",
    "1736",
    "1882",
    "1883",
    "1888",
    "1854",
    "1858",
    "1863",
    "1870",
    "1886",
    "190",
    "196",
    "213",
    "224",
    "244",
    "286",
    "292",
    "293",
    "318",
    "329",
    "33",
    "34",
    "350",
    "384",
    "389",
    "394",
    "418",
    "430",
    "431",
    "452",
    "479",
    "499",
    "500",
    "543",
    "570",
    "623",
    "694",
    "695",
    "696",
    "725",
    "763",
    "810",
    "860",
    "886",
    "892",
    "906",
    "92",
    "926",
    "927",
    "935",
    "937",
    "94",
    "946",
    "970",
    "986",
]

Let's list them again, excluding those in the 'dodgy' list.

In [67]:
for n, l in papers:
    if not l.loc[
        (~df["language"].isin(["en"]))
        & (df["proportion"] >= 0.05)
        & (~df["id"].isin(dodgy))
    ].empty:
        print(f"\n{n[0]} ({n[1]})")
        display(
            l[["language_full", "language", "proportion"]]
            .loc[(l["proportion"] > 0.05)]
            .sort_values(by="proportion", ascending=False)
        )
A Voz de Timor (Dili, East Timor : 1970 - 1975) (1498)
language_full language proportion
9 Portuguese pt 0.919192
Adelaider Deutsche Zeitung (SA : 1851 - 1862) (277)
language_full language proportion
915 German de 1.0
Australier Leben = Australian Life (Melbourne, Vic. : 1931 - 1933) (1686)
language_full language proportion
1256 Hebrew he 1.0
Australijos Lietuvis = The Australian Lithuanian (SA : 1948 - 1956) (1876)
language_full language proportion
920 Lithuanian lt 0.97
Australische Zeitung (Adelaide, SA : 1875 - 1916) (1150)
language_full language proportion
922 German de 1.0
Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946) (1283)
language_full language proportion
17 Indonesian id 0.99
Chinese Republic News (Sydney, NSW : 1914 - 1937) (1186)
language_full language proportion
100 Chinese zh 0.97
Chinese Times (Melbourne, Vic. : 1902 - 1922) (705)
language_full language proportion
1293 Chinese zh 1.0
Chung Wah News (Perth, WA : 1981 - 1987) (1383)
language_full language proportion
1842 English en 0.50
1841 Chinese zh 0.49
Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952) (1385)
language_full language proportion
1867 German de 0.82
1868 English en 0.18
Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906) (1600)
language_full language proportion
149 German de 1.0
Deutsche Zeitung fur Sud-Australien = German Times for South Australia (Tanunda, SA : 1851) (1577)
language_full language proportion
935 German de 0.9
934 English en 0.1
Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939) (1591)
language_full language proportion
150 German de 0.71
151 English en 0.29
Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851) (1576)
language_full language proportion
936 German de 0.99
Dutch Australian Weekly (Sydney, NSW : 1951 - 1993) (1044)
language_full language proportion
156 Dutch nl 0.979592
Dutch Weekly (Sydney, NSW : 1993 - 2004) (1045)
language_full language proportion
159 Dutch nl 0.939394
160 English en 0.060606
Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952) (1384)
language_full language proportion
1872 Polish pl 0.88
1873 English en 0.12
Eco Italiano (Perth, WA : 1958 - 1959) (1387)
language_full language proportion
1874 Italian it 0.979592
Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923) (704)
language_full language proportion
186 Chinese zh 1.0
Hellenic Echo (Perth, WA : 1967 - 1968) (1389)
language_full language proportion
1913 Modern Greek (1453-) el 1.0
Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957) (1378)
language_full language proportion
1915 Italian it 0.96
Il Giornale Italiano (Sydney, NSW : 1932 - 1940) (279)
language_full language proportion
197 Italian it 0.91
198 English en 0.09
Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954) (1601)
language_full language proportion
199 Italian it 0.75
200 English en 0.25
Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940) (1602)
language_full language proportion
211 English en 0.80
212 Italian it 0.15
Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935) (1603)
language_full language proportion
214 English en 0.85
215 Italian it 0.14
Italo-Australian (Sydney, NSW : 1927 - 1940) (1595)
language_full language proportion
217 Italian it 0.97
Japanese Perth Times (Subiaco, WA : 1989 - 1996) (1386)
language_full language proportion
1918 Japanese ja 0.96
L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885) (1596)
language_full language proportion
227 Italian it 0.702128
229 Latin la 0.148936
230 Aragonese an 0.063830
228 Quechua qu 0.053191
L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909) (1597)
language_full language proportion
234 Italian it 0.97
La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984) (1388)
language_full language proportion
1937 Italian it 0.98
Le Courrier Australien (Sydney, NSW : 1892 - 2011) (829)
language_full language proportion
238 French fr 0.76
239 English en 0.24
Mediterranean Voice (Perth, WA : 1971 - 1972) (1390)
language_full language proportion
1955 Modern Greek (1453-) el 0.333333
1954 English en 0.232323
1956 Portuguese pt 0.161616
1950 French fr 0.080808
1949 Spanish es 0.060606
Meie Kodu - Our Home (Sydney, NSW : 1949 - 1956) (280)
language_full language proportion
248 Estonian et 1.0
Musu Pastoge = Our Haven (Sydney, NSW : 1950 - 1954) (1594)
language_full language proportion
263 Lithuanian lt 0.95
Nasza droga (Adelaide, SA : 1952 - 1954) (1323)
language_full language proportion
964 Polish pl 0.89
965 English en 0.11
Norden (Melbourne, Vic. : 1914 - 1918) (797)
language_full language proportion
1460 Danish da 0.642857
1462 Norwegian no 0.153061
1464 Swedish sv 0.091837
1461 English en 0.061224
Oceania (Sydney, NSW : 1913 - 1915) (1598)
language_full language proportion
282 Italian it 0.54
283 English en 0.46
Revue Australienne : Journal des Interets Francais en Australie ... (Sydney, NSW : 1873 - 1874) (1604)
language_full language proportion
302 French fr 0.98
Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932) (1380)
language_full language proportion
2024 Italian it 0.98
Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851) (314)
language_full language proportion
1018 German de 0.888889
1019 English en 0.111111
Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959) (1379)
language_full language proportion
2030 Italian it 1.0
Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874) (278)
language_full language proportion
1016 German de 0.99
The Australian Jewish News (Melbourne, Vic. : 1935 - 1999) (1685)
language_full language proportion
1572 English en 0.77
1573 Hebrew he 0.23
The Australian Jewish Post (St. Kilda, Vic. : 1966 - 1968) (1777)
language_full language proportion
1574 Hebrew he 1.0
The Chinese Advertiser (Ballarat, Vic. : 1856) (706)
language_full language proportion
1601 Chinese zh 0.8
1602 Latin la 0.2
The English and Chinese Advertiser (Vic. : 1856 - 1858) (685)
language_full language proportion
1620 Chinese zh 0.772727
1619 Latin la 0.227273
The Jewish Post (Melbourne, Vic. : 1949 - 1966) (1776)
language_full language proportion
1640 Hebrew he 1.0
The Jewish Weekly News (Melbourne, Vic. : 1933 - 1935) (1707)
language_full language proportion
1641 English en 0.81
1642 Hebrew he 0.19
The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957) (1381)
language_full language proportion
2215 Modern Greek (1453-) el 0.98
To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954) (1592)
language_full language proportion
706 Modern Greek (1453-) el 1.0
Tung Wah News (Sydney, NSW : 1898 - 1902) (1185)
language_full language proportion
713 Chinese zh 1.0
Tung Wah Times (Sydney, NSW : 1901 - 1936) (1184)
language_full language proportion
714 Chinese zh 0.99
Uniamoci (Sydney, NSW : 1903 - 1904) (1599)
language_full language proportion
725 Italian it 1.0
Vesnik (Perth, WA : 1975 - 1994) (1382)
language_full language proportion
2249 Macedonian mk 0.412371
2248 English en 0.340206
2251 Bosnian bs 0.144330
Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954) (1593)
language_full language proportion
728 Ukrainian uk 0.82
729 English en 0.18

Here we'll add the dodgy title ids into our filter. It seems that we have 55 newspapers with significant amounts of non-English content.

In [68]:
# The filter removes titles that only have one language, which is English
filtered = (
    df.loc[(~df["id"].isin(dodgy)) & (df["proportion"] >= 0.05)]
    .groupby(by=["title", "id"])
    .filter(lambda x: (len(x) > 1) or (len(x) == 1 and x["language"] != "en"))
)
papers = filtered.groupby(by=["title", "id"])
len(papers)
Out[68]:
55

Let's list them.

In [69]:
for n, l in papers:
    print(n[0])
A Voz de Timor (Dili, East Timor : 1970 - 1975)
Adelaider Deutsche Zeitung (SA : 1851 - 1862)
Australier Leben = Australian Life (Melbourne, Vic. : 1931 - 1933)
Australijos Lietuvis = The Australian Lithuanian (SA : 1948 - 1956)
Australische Zeitung (Adelaide, SA : 1875 - 1916)
Berita Repoeblik (Djakarta, Indonesia : 1945 - 1946)
Chinese Republic News (Sydney, NSW : 1914 - 1937)
Chinese Times (Melbourne, Vic. : 1902 - 1922)
Chung Wah News (Perth, WA : 1981 - 1987)
Der Australische Spiegel = The Australian Mirror (Perth, WA : 1952)
Deutsch-Australische Post : Wochenschrift = German-Australian Post : Weekly (Sydney, NSW : 1893 - 1906)
Deutsche Zeitung fur Sud-Australien = German Times for South Australia (Tanunda, SA : 1851)
Die Brucke = The Bridge (Sydney, NSW : 1934 - 1939)
Die Deutsche Post für die Australischen Colonien = The German Australian Post (Adelaide, SA : 1848 - 1851)
Dutch Australian Weekly (Sydney, NSW : 1951 - 1993)
Dutch Weekly (Sydney, NSW : 1993 - 2004)
Echo : Polski Tygodnik Niezalezny (Perth, WA : 1950 - 1952)
Eco Italiano (Perth, WA : 1958 - 1959)
Guang yi hua bao = The Chinese Australian Herald (Sydney, NSW : 1894 - 1923)
Hellenic Echo (Perth, WA : 1967 - 1968)
Il Canguro = The Kangaroo (Perth, WA : 1955 - 1957)
Il Giornale Italiano (Sydney, NSW : 1932 - 1940)
Il Risveglio = The Awakening (Sydney, NSW : 1944 - 1954)
Italian Bulletin of Australia (Sydney, NSW : 1922 - 1928, 1935 - 1940)
Italian Bulletin of Commerce (Sydney, NSW : 1929 - 1935)
Italo-Australian (Sydney, NSW : 1927 - 1940)
Japanese Perth Times (Subiaco, WA : 1989 - 1996)
L'Italo-Australiano = The Italo-Australian (Surry Hills, NSW : 1885)
L'Italo-Australiano = The Italo-Australian (Sydney, NSW : 1905 - 1909)
La Rondine (Perth, WA : 1970 - 1974; 1983 - 1984)
Le Courrier Australien (Sydney, NSW : 1892 - 2011)
Mediterranean Voice (Perth, WA : 1971 - 1972)
Meie Kodu - Our Home (Sydney, NSW : 1949 - 1956)
Musu Pastoge = Our Haven (Sydney, NSW : 1950 - 1954)
Nasza droga (Adelaide, SA : 1952 - 1954)
Norden (Melbourne, Vic. : 1914 - 1918)
Oceania (Sydney, NSW : 1913 - 1915)
Revue Australienne : Journal des Interets Francais en Australie ... (Sydney, NSW : 1873 - 1874)
Stampa Italiana = The Italian Press (Perth, WA : 1931 - 1932)
Suedaustralische Zeitung (Adelaide, SA : 1850 - 1851)
Sunday Times Edizione Italiana (Perth, WA : 1958 - 1959)
Süd Australische Zeitung (Tanunda and Adelaide, SA : 1860 - 1874)
The Australian Jewish News (Melbourne, Vic. : 1935 - 1999)
The Australian Jewish Post (St. Kilda, Vic. : 1966 - 1968)
The Chinese Advertiser (Ballarat, Vic. : 1856)
The English and Chinese Advertiser (Vic. : 1856 - 1858)
The Jewish Post (Melbourne, Vic. : 1949 - 1966)
The Jewish Weekly News (Melbourne, Vic. : 1933 - 1935)
The Voice of Freedom = Elefthera Phoni (Perth, WA : 1956 - 1957)
To Ethnico Vema = Greek National Tribune (Arncliffe, NSW : 1931 - 1954)
Tung Wah News (Sydney, NSW : 1898 - 1902)
Tung Wah Times (Sydney, NSW : 1901 - 1936)
Uniamoci (Sydney, NSW : 1903 - 1904)
Vesnik (Perth, WA : 1975 - 1994)
Vil'na Dumka = Free Thought (Sydney, NSW : 1949 - 1954)

That's looking pretty good. Let's save the results as a Markdown file to make it easy to explore. We'll include links into Trove. Here's the list of all 55 newspapers (also as a Gist).

In [70]:
with open(Path("non-english-newspapers.md"), "w") as md_file:
    i = 1
    for n, l in papers:
        md_file.write(
            f"\n### {i}. [{n[0]}](http://nla.gov.au/nla.news-title{n[1]})\n\n"
        )
        md_file.write("| Language | Language code | Proportion of sample |\n")
        md_file.write("|---|---|---|\n")
        for row in (
            l[["language_full", "language", "proportion"]]
            .loc[(l["proportion"] > 0.05)]
            .sort_values(by="proportion", ascending=False)
            .itertuples()
        ):
            md_file.write(
                f"| {row.language_full} | {row.language} | {row.proportion} |\n"
            )
        i += 1

Save the results as a CSV file.

In [71]:
filtered.to_csv(
    f"newspapers_non_english_{datetime.now().strftime('%Y%m%d')}.csv", index=False
)
In [ ]:
# IGNOTE THIS CELL -- FOR TESTING ONLY
if os.getenv("GW_STATUS") == "dev":
    newspaper_langs = find_languages(sample_size=5)
    df = pd.DataFrame(newspaper_langs)
    assert df.shape[0] >= 5
    assert list(df.columns) == ["id", "title", "language", "proportion", "number"]

Created by Tim Sherratt for the GLAM Workbench.
Support this project by becoming a GitHub sponsor.