english corpus free download

Showing 47 open source projects for "english corpus"

View related business solutions

MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
1

DeepSeek Coder

DeepSeek Coder: Let the Code Write Itself

DeepSeek-Coder is a series of code-specialized language models designed to generate, complete, and infill code (and mixed code + natural language) with high fluency in both English and Chinese. The models are trained from scratch on a massive corpus (~2 trillion tokens), of which about 87% is code and 13% is natural language. This dataset covers project-level code structure (not just line-by-line snippets), using a large context window (e.g. 16K) and a secondary fill-in-the-blank objective to encourage better contextual completions and infilling. ...

Downloads: 9 This Week

Last Update: 2025-11-11
See Project
2

iramuteq

IRAMUTEQ : Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires. Logiciel de traitement de données pour des corpus texte ou de type individus/caractères. Permet notamment de réaliser des analyses de type "ALCESTE"

Downloads: 998 This Week

Last Update: 2024-11-03
See Project
3

Dodge gpt

Bypass Ai content for GPTZero and others making text Undetectable

*New Update* ╔════════════════════════════════════════════════════════════════╗ ║ DODGE V10 - STEALTH EDITION ║ The Only AI Text Humanizer That Defeats GPTZero ╚════════════════════════════════════════════════════════════════╝ █████████████████████████████████████████████████████████████████ █ █ █ 🛡️ CURRENT STATUS: GPTZERO RESISTANT - VERIFIED 2026 █ █ 📊...

2 Reviews

Downloads: 2 This Week

Last Update: 2026-02-17
See Project
4

TEXminer

Text Mining Classification for Texts in ASCII, Unicode and PDF Format.

...TEXminer allows Language Detection by Letter Frequency Analysis, finding important Words by Cooccurrence Analysis, Determination of Central Expressions, Thematic Text Classification (also Semantic Groups) Fingerprint Comparison and Word Frequency. Because TEXminer is not disigned to have a Reference Corpus, Thematic Model Statistics uses Language Models (lexicons) to have Background Knowledge about certain Languages (English, German, French, Spanish, Italian, Russian), which are derived from Decaleon Project. The Thematic Models for Standard Vocabulary have been extended (spring 2015). The Thematic Models for Technical Terms have been extended (2015). ...

Downloads: 0 This Week

Last Update: 2025-03-25
See Project
Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
5

LF Aligner

LF Aligner helps translators create translation memories from texts and their translations. It relies on Hunalign for automatic sentence pairing. Input: txt, doc, docx, rtf, pdf, html. Output: tab delimited txt, TMX and xls. With web features. My email address is listed in readme.txt; for support, use the forum here. My personal website: www.farkastranslations.com.

13 Reviews

Downloads: 183 This Week

Last Update: 2023-09-04
See Project
6

Linguistic Analyzer

The Linguistic Analyzer is a tool for corpus analysis and comparison

The Linguistic Analyzer (Almuhalil Alloghawy) is a free tool designed by a team from Al-Imam Muhammad bin Saud islamic university that can be used for corpus analysis and comparison in terms of the several linguistic characteristics, such as frequency lists generation, concordances, collocation extraction, the difference between two words, and keyword identification.

Downloads: 1 This Week

Last Update: 2022-04-16
See Project
7

DWDS/Dialing Concordance

a collection of indexing and search tools for corpus linguists

DWDS/Dialing Concordance (DDC) - a collection of index and search tools for corpus linguists

2 Reviews

Downloads: 29 This Week

Last Update: 2021-06-16
See Project
8

Arabic Corpus

Text categorization, arabic language processing, language modeling

The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references: (1) For Watan-2004 corpus ---------------------- M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods on...

Downloads: 3 This Week

Last Update: 2019-03-05
See Project
9

concordia

Powerful search library, best suited for computer-aided translation

Concordia - Roman goddess of agreement. Concordance searcher - tool for translators who need their translations to "agree" with one standard. Concordia is a C++ library for fast text lookup in large corpora. It uses a RAM stored index, which takes up approximately 600MB of memory for a corpus of 2 million sentences. It is based on the idea of a suffix array, enhanced by the presence of other auxiliary data structures. The effects are stunning - Concordia is able to do simple substring...

Downloads: 0 This Week

Last Update: 2019-02-28
See Project
Try Google Cloud Risk-Free With $300 in Credit
No hidden charges. No surprise bills. Cancel anytime.

Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.

Start Free
10

QJDicExample

QJDicExample is an English <-> Japanese dictionary.

QJDicExample is an Japanese to English and English to Japanese dictionary featuring words/names/kanji/sentences search. QJDicExample uses JMdict, JMnedict, Kanjidic2, Radkfilex, KanjiVG, Tanaka Corpus / Tatoeba databases for translations and zinnia recognition library for handwritten kanji recognition. Latest source code: git clone git://git.code.sf.net/p/qjdicexample/code qjdicexample-code

1 Review

Downloads: 4 This Week

Last Update: 2019-01-19
See Project
11

eMargin

online collaborative annotation

Developed by the Research and Development Unit for English Studies at Birmingham City University, eMargin is an online collaborative annotation tool that lets you highlight, colour-code, write notes and assign tags to individual words or passages of a text. These annotations can be shared amongst groups, generating discussions and allowing analyses and interpretations to be combined.

Downloads: 0 This Week

Last Update: 2018-10-02
See Project
12

Corpus Toolkit

A text management tool for linguistic purposes...

Downloads: 0 This Week

Last Update: 2017-11-23
See Project
13

English-Vietnamese Bilingual Corpus

The English-Vietnamese Bilingual Corpus (EVBCorpus) is a collection of English and Vietnamese parallel translations and bitexts.

Downloads: 0 This Week

Last Update: 2016-09-01
See Project
14

texrex

Web corpus creation software (moved to GitHub)

This project has moved to GitHub: https://github.com/rsling/texrex https://github.com/rsling/cow

Downloads: 0 This Week

Last Update: 2016-04-20
See Project
15

ICE Nigeria

Nigerian component of the International Corpus of English

This is the Nigerian component of the International Corpus of English, a one million word corpus of written and spoken Nigerian English for linguistic research. It can be used as a stand-alone corpus or in conjunction with other components of the International Corpus of English (such as ICE-GB, ICE-India, etc.) to compare international varieties of English. This is the first release of the complete corpus. ...

1 Review

Downloads: 6 This Week

Last Update: 2015-11-03
See Project
16

Cross-Language Computational Linguistics

cross-languages resources

AFEWC corpus is a multilingual comparable text articles in Arabic, French, and English languages. Each triple article is related to the same topic (aligned at article level). AFEWC corpus is collected from Wikipedia. The corpus is available for free for research purposes only. It is composed of 40K aligned articles, 91.3M English words, 57.8M French words, 22M Arabic words, 2.8M English unique words, 1.9M French unique words, and 1.5M Arabic unique words. ...

Downloads: 0 This Week

Last Update: 2015-09-11
See Project
17

Osman Arabic Text Readability

Open Source tool for Arabic text readability

We present OSMAN (Open Source Metric for Measuring Arabic Narratives) - a novel open source Arabic readability metric and tool. The open source Java tool allows users to calculate readability for Arabic text (with and without diacritics). The tool provides methods to split the text into words and sentence, count syllables, Faseeh letters, hard and complex words in addition to adding diacritics (vocalise text). This makes the tool useful for researchers and educators working with Arabic text....

Downloads: 0 This Week

Last Update: 2016-11-17
See Project
18

mwetoolkit

THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/

THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/ The Multiword Expressions toolkit aids in the automatic identification and extraction of multiword units in running text. These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc. Even though it focuses on multiword expresisons, the framework is quite complete and can also be useful in any corpus-based study in computational linguistics. The mwetoolkit can be...

1 Review

Downloads: 0 This Week

Last Update: 2019-05-01
See Project
19

Natural Language Analysis with Ngrams

NLP tool for statistical analysis of words, sentences, documents

...In the future versions, user will be able to convert a single word to numerical data, to be able to compare two words and get the comparison data, and to be able to do the same for the sentences, paragraphs and documents. I will JAR-it once I decide that it can be called a final release. This project was made by creating a corpus from the Google Ngrams data for English Language, version 20120701. EOWL list of English words was used to filter-out the words from Ngrams data. For each year, per word, the data was added and calculated to describe the average appearance of a word per document for a given year. Before using this program, you MUST download the corpus.

Downloads: 0 This Week

Last Update: 2015-02-01
See Project
20

DisMo

A POS, disfluency and multi-word unit annotator for spoken language

DisMo is a part-of-speech, disfluency and multi-word unit automatic annotator. It is designed to manage the complexities and phenomena specific to spoken language. It currently supports English and French, with support for more languages coming soon. It is developed and maintained by George Christodoulides (Centre Valibel, IL&C, University of Louvain, Louvain-la-Neuve, Belgium). Visit www.corpusannotation.org to find out more about DisMo and other annotation tools for language corpora. If you are using DisMo to annotate your corpus, please cite the following paper: Christodoulides, George; Avanzi, Mathieu; Goldman, Jean-Philippe. ...

Downloads: 0 This Week

Last Update: 2014-10-23
See Project
21

LeaP corpus

A phonological corpus of learner English and learner German

The LeaP corpus is a phonologically annotated corpus that comprises spoken language produced by 46 learners of English and 55 learners of German as well as recordings with 4 native speakers of English and 7 native speakers of German. In total, it consists of 12 hours of speech and was collected at the University of Bielefeld (Germany) between 2001 and 2003 as part of the LeaP (Learning Prosody in a Foreign Language) project, which investigated the acquisition of prosody by second language learners of German and English with special focus on stress, intonation, and speech rhythm as well as influencing factors on the acquisition process and outcome.

Downloads: 1 This Week

Last Update: 2015-10-05
See Project
22

Transml

Phrase based Statistical Machine Transltion system for English Languag

This software will translate English language to Malayalam and vice versa. Statistical Machine Translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The SMT is a corpus based approach, where a massive parallel corpus is required for training the SMT systems.

Downloads: 0 This Week

Last Update: 2013-10-03
See Project
23

CorpusSearch

CorpusSearch finds syntactic structures in a corpus of annotated sentence trees. It can be used as a research tool on a corpus, or as a development tool for building the corpus.

Downloads: 60 This Week

Last Update: 2013-06-26
See Project
24

zkanji - Japanese Language Study Suite

Japanese vocabulary and kanji study tool with built in dictionary

zkanji is a feature rich Japanese language study suite and dictionary for Windows. It has several kanji look-up methods, optional example sentences for many Japanese words, vocabulary printing, JLPT levels indicated for words and kanji for all N levels, spaced-repetition system for studying and more. Visit http://zkanji.sourceforge.net for details

14 Reviews

Downloads: 31 This Week

Last Update: 2013-08-08
See Project
25

Khmer Automatic Translation

Khmer-English-Khmer Automatic Translation

The project attempts to develop a parallel-corpus-based hybrid high quality English-Khmer-English automatic translation system based on statistical analysis and enhanced with part-of-speech analysis.

Downloads: 1 This Week

Last Update: 2013-05-29
See Project