Table of Contents
Editing Large Text Files
Return to Editing massively large text files, Data Science, Python Data Science, DataOps, Data Cleaning, Python ML - Python DL - Python NLP - Python MLOps, Data Science bibliography, Data Science glossary, Awesome Data Science, Data Science topics
For Big data editing, besides Python data cleaning, I recommend:
Cloud Monk's Review of the Buggy and Way Overly Complicated Text Editor to AVOID called EmEditor.
SO BUGGY!!! I no longer recommend this product due to numerous keyboard shortcut bugs that the author refuses to fix even after I spend 5 hours documenting them in several emails. His English is very poor so he doesn’t understand what I say. And then asks me to re-explain it differently. Ugh!
It is fine for mouse only use, but if you use only the keyboard and the standard Windows editing keyboard shortcuts, you will be very frustrated.
Yutaka Emura is creator of this very overly complicated text editor. I highly recommend to AVOiD it if you use keyboard shortcuts instead of constantly mousing.
Notepad Plus Plus is FAR superior.
https://stackoverflow.com/questions/159521/text-editor-to-open-big-giant-huge-large-text-files
The author is horrible at creating bugs, fixing them and then reintroducing the same bugs again over several years. This is developer Yutaka Emura.
OLD REVIEW:
- “EmEditor is so good, I would gladly pay $300 per year for it. People pay Microsoft that much for MS Word. Having tested more than 30 different editors, I can say it is the fastest text editor on the planet! I use it for editing massively large text files for big data (huge data) and data science since it has extremely fast multi-threaded search and replace that allows me to use only keyboard shortcuts rather than the mouse. Due to the multi-threaded superfast saves I can quickly get back to editing.”
- Snippet from Wikipedia: Data cleansing
Data cleansing or data cleaning is the process of identifying and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset, table, or database. It involves detecting incomplete, incorrect, or inaccurate parts of the data and then replacing, modifying, or deleting the affected data. Data cleansing can be performed interactively using data wrangling tools, or through batch processing often via scripts or a data quality firewall.
After cleansing, a data set should be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data.
The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code), or with fuzzy or approximate string matching (such as correcting records that partially match existing, known records). Some data cleansing solutions will clean data by cross-checking with a validated data set. A common data cleansing practice is data enhancement, where data is made more complete by adding related information. For example, appending addresses with any phone numbers related to that address. Data cleansing may also involve harmonization (or normalization) of data, which is the process of bringing together data of "varying file formats, naming conventions, and columns", and transforming it into one cohesive data set; a simple example is the expansion of abbreviations ("st, rd, etc." to "street, road, etcetera").
Research More
Fair Use Sources
Data Science: Fundamentals of Data Science, DataOps, Big Data, Data Science IDEs (Jupyter Notebook, JetBrains DataGrip, Google Colab, JetBrains DataSpell, SQL Server Management Studio, MySQL Workbench, Oracle SQL Developer, SQLiteStudio), Data Science Tools (SQL, Apache Arrow, Pandas, NumPy, Dask, Spark, Kafka); Data Science Programming Languages (Python Data Science, NumPy Data Science, R Data Science, Java Data Science, C++ Data Science, MATLAB Data Science, Scala Data Science, Julia Data Science, Excel Data Science (Excel is the most popular "programming language") - Google Sheets, SAS Data Science, C# Data Science, Golang Data Science, JavaScript Data Science, Kotlin Data Science, Ruby Data Science, Rust Data Science, Swift Data Science, TypeScript Data Science, Bash Data Science); Databases, Data, Augmentation, Analysis, Analytics, Archaeology, Cleansing, Collection, Compression, Corruption, Curation, Degradation, Editing (EmEditor), Data engineering, ETL/ ELT ( Extract- Transform- Load), Farming, Format management, Fusion, Integration, Integrity, Lake, Library, Loss, Management, Migration, Mining, Pre-processing, Preservation, Protection (privacy), Recovery, Reduction, Retention, Quality, Science, Scraping, Scrubbing, Security, Stewardship, Storage, Validation, Warehouse, Wrangling/munging. ML-DL - MLOps. Data science history, Data Science Bibliography, Manning Data Science Series, Data science Glossary, Data science topics, Data science courses, Data science libraries, Data science frameworks, Data science GitHub, Data Science Awesome list. (navbar_datascience - see also navbar_python, navbar_numpy, navbar_data_engineering and navbar_database)
© 1994 - 2024 Cloud Monk Losang Jinpa or Fair Use. Disclaimers
SYI LU SENG E MU CHYWE YE. NAN. WEI LA YE. WEI LA YE. SA WA HE.